BANDITS is a Bayesian hierarchical method to perform differential splicing via differential transcript usage (DTU). BANDITS uses a hierarchical structure, via a Dirichlet-multinomial model, to explicitly model the over-dispersion between replicates and allowing for sample-specific transcript relative abundance (i.e., the proportions). More mathematically, consider a gene with K transcripts with transcript level counts Y = (Y1, …, YK); we assume that Y ∼ DM(π1, …, πK, δ), where DM denotes the Dirichlet-multinomial distribution, π1, …, πK indicate the relative abundance of transcripts 1, …, K, and δ represents the precision parameter, modelling the degree of over-dispersion between samples.
We input the equivalence classes and respective counts, where the equivalence classes represent the group of transcripts reads are compatible with. The method is embedded in a Bayesian hierarchical framework, where the posterior densities of the parameters are inferred via Markov chain Monte Carlo (MCMC) techniques. The allocation of each RNA-seq read to its transcript of origin is treated as a latent variable and also sampled in the MCMC. To test for DTU, we compare the average transcript relative abundance between two or more conditions. A statistical test is performed, both, at the gene- and transcript-level, allowing scientists to investigate what specific transcripts are differentially used in significant genes.
To access the R code used in the vignettes, type:
Questions relative to BANDITS should be reported as a new issue at BugReports.
To cite BANDITS, type:
## To cite BANDITS in publications use:
##
## Simone Tiberi and Mark D. Robinson (2020). BANDITS: Bayesian
## differential splicing accounting for sample-to-sample variability and
## mapping uncertainty. Genome Biology, 21 (69). URL
## https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01967-8
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {BANDITS: Bayesian differential splicing accounting for sample-to-sample variability and mapping uncertainty},
## author = {Simone Tiberi and Mark D. Robinson},
## eprint = {https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01967-8},
## journal = {Genome Biology},
## volume = {21},
## number = {69},
## year = {2020},
## doi = {10.1186/s13059-020-01967-8},
## url = {https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01967-8},
## publisher = {Springer},
## }
BANDITS
is available on Bioconductor and
can be installed with the command:
To install the latest development version of the package from github,
use devtools
(available here):
To install the package jointly with its vignette remove
--no-build-vignettes
from build_opts
:
The package inputs the equivalence classes and respective counts. These can be obtained by aligning reads either directly to a reference transcriptome with pseudo-alignmers, via salmon (Patro et al. 2017) or kallisto (Bray et al. 2016), or to a reference genome with splice-aware genome alignment algorithms, via STAR (Dobin et al. 2013), and checking the transcripts compatible with each genome alignment with salmon.
NOTE: when using salmon, use the option
--dumpEq
to obtain the equivalence classes, when using
STAR, use the option --quantMode TranscriptomeSAM
to obtain alignments translated into transcript coordinates, and when
using kallisto, run both kallisto quant
and
kallisto pseudo
to obtain the transcript estimated counts
and equivalence classes, respectively.
The file README provides three pipelines for aligning reads with salmon, kallisto and STAR.
Further to the equivalence classes, our tool requires the matching between gene and transcript ids, compatible with the genome or transcriptome used to align reads. There are multiple ways to compute a gene-transcript compatibility matrix; below we show two examples to create it, accoriding to whether reads are aligned with a genome and transcriptome aligner. Bear in mind that the example code below will not work on any given gtf and fasta file and adjustments might be needed; alternative approaches to compute gene-transcript matchings are illustrated in tximport (Soneson, Love, and Robinson 2015) vignette.
If the reads are aligned to the genome first (with STAR), we can compute a gene-transcript association from the gtf file via GenomicFeatures (Lawrence et al. 2013) library. Here we provide an example code:
suppressMessages(library(GenomicFeatures))
gtf_file = system.file("extdata","GTF_files","Aedes_aegypti.partial.gtf",
package="GenomicFeatures")
tx = makeTxDbFromGFF(gtf_file)
ss = unlist(transcriptsBy(tx, by="gene"))
gene_tr_id_gtf = data.frame(gene_id = names(ss), transcript_id = ss$tx_name )
# remove eventual NA's:
gene_tr_id_gtf = gene_tr_id_gtf[ rowSums( is.na(gene_tr_id_gtf)) == 0, ]
# remove eventual duplicated rows:
gene_tr_id_gtf = unique(gene_tr_id_gtf)
If the reads are aligned directly to the transcriptome (with salmon or kallisto), we compute a gene-transcript association from the cDNA fasta file via Biostrings (Pagès et al. 2019) library. Here we provide an example code:
suppressMessages(library(Biostrings))
data_dir = system.file("extdata", package = "BANDITS")
fasta = readDNAStringSet(file.path(data_dir, "Homo_sapiens.GRCh38.cdna.all.1.1.10M.fa.gz"))
ss = strsplit(names(fasta), " ")
gene_tr_id_fasta = data.frame(gene_id = gsub("gene:", "", sapply(ss, .subset, 4)),
transcript_id = sapply(ss, .subset, 1))
# remove eventual NA's
gene_tr_id_fasta = gene_tr_id_fasta[ rowSums( is.na(gene_tr_id_fasta)) == 0, ]
# remove eventual duplicated rows:
gene_tr_id_fasta = unique(gene_tr_id_fasta)
Load BANDITS
Specify the directory of the data (internal in the package).
We need a matrix or data.frame containing the matching between gene and transcript identifiers. The file “alignment and gene-transcript matching.txt” shows how to create such a file from a gtf (in case of genome alignment) or from a fasta file (in case of transcript alignment).
Load the precomputed gene-transcript matching.
gene_tr_id
is a data.frame (but a matrix is also accepted)
containing the transcripts ids on the second column and the
corresponding gene ids on the first column.
## gene_id transcript_id
## 2 ENSG00000223972 ENST00000456328
## 6 ENSG00000223972 ENST00000450305
## 14 ENSG00000227232 ENST00000488147
## 27 ENSG00000278267 ENST00000619216
## 30 ENSG00000243485 ENST00000473358
## 34 ENSG00000243485 ENST00000469289
Specify the directory of the transcript level estimated counts.
sample_names = paste0("sample", seq_len(4))
quant_files = file.path(data_dir, "STAR-salmon", sample_names, "quant.sf")
file.exists(quant_files)
## [1] TRUE TRUE TRUE TRUE
Load the transcript level estimated counts via tximport.
## reading in files with read.delim (install 'readr' package for speed up)
## 1 2 3 4
## [,1] [,2] [,3] [,4]
## ENST00000456328 5.00000 2.00000 8.00000 2.00000
## ENST00000450305 0.00000 0.00000 0.00000 0.00000
## ENST00000488147 47.39685 43.09507 65.84316 31.40446
## ENST00000619216 0.00000 0.00000 0.00000 0.00000
## ENST00000473358 1.00000 1.00000 1.00000 1.00000
## ENST00000469289 0.00000 0.00000 1.00000 0.00000
We define the design of the study: in our case we have 2 groups, that we call “A” and “B” of 2 samples each.
## sample_id group
## 1 sample1 A
## 2 sample2 A
## 3 sample3 B
## 4 sample4 B
The groups are defined in:
## NULL
Here we consider a two-group comparison, however BANDITS also allows to compare more than 2 groups.
Before loading the data, we also compute, via
eff_len_compute
, the median effective length of each
transcript (the median is computed with respect to the samples).
## ENST00000456328 ENST00000450305 ENST00000488147 ENST00000619216 ENST00000473358
## 1503.1350 478.2580 1197.1350 3.0210 558.2165
## ENST00000469289
## 381.3605
Pre-filtering lowly abundant transcripts was found to improve performance of differential splicing methods; furthermore, by simplifying the inferential problem, it also leads to a significant reduction in the computational cost of our method. Albeit not strictly required, we highly suggest to pre-filter transcripts. Here, we use a mild filtering cutoff by remove transcripts whose average relative abundance is below 0.01. For the filtering step, we use transcript-level estimated counts to compute the average relative abundance.
Compute the transcripts to keep, by filtering lowly abundant
transcripts. Here min_transcript_proportion = 0.01
will
remove transctipts with estimated mean relative abundance below 0.01. We
further impose constraints on the total abundance:
min_transcript_counts = 10
indicates that each transcript
must have at least 10 estimated counts (adding counts from all samples),
and min_gene_counts = 20
specifies that each gene should
have at least 20 estimated counts (adding counts from all samples).
While running, filter_transcripts
prints on screen the
percentage of transcripts kept after filtering.
transcripts_to_keep = filter_transcripts(gene_to_transcript = gene_tr_id,
transcript_counts = counts,
min_transcript_proportion = 0.01,
min_transcript_counts = 10,
min_gene_counts = 20)
## After filtering, 12.24% of transcripts are kept
## [1] "ENST00000377577" "ENST00000358779" "ENST00000341426" "ENST00000341991"
## [5] "ENST00000377403" "ENST00000054666"
Below we illustrate how to load the equivalence classes computed with
salmon
or kallisto
.
We specify the path to the equivalence classes computed by
salmon
in equiv_classes_files
.
equiv_classes_files = file.path(data_dir, "STAR-salmon", sample_names,
"aux_info", "eq_classes.txt")
file.exists(equiv_classes_files)
## [1] TRUE TRUE TRUE TRUE
Warning: the sample names in equiv_classes_files
must
have the same order as those in the design object, containted in
samples_design
.
## [1] "/tmp/RtmpAODvWl/Rinst1a2c11219bc2/BANDITS/extdata/STAR-salmon/sample1/aux_info/eq_classes.txt"
## [2] "/tmp/RtmpAODvWl/Rinst1a2c11219bc2/BANDITS/extdata/STAR-salmon/sample2/aux_info/eq_classes.txt"
## [3] "/tmp/RtmpAODvWl/Rinst1a2c11219bc2/BANDITS/extdata/STAR-salmon/sample3/aux_info/eq_classes.txt"
## [4] "/tmp/RtmpAODvWl/Rinst1a2c11219bc2/BANDITS/extdata/STAR-salmon/sample4/aux_info/eq_classes.txt"
## [1] "sample1" "sample2" "sample3" "sample4"
We then import the equivalence classes and respective counts, and
create a BANDITS_data
object via create_data
.
When providing transcripts_to_keep
, the function filters
internally transcripts that are not in the vector. When filtering
transripts, we suggest to parallelize computations and use one core per
sample (i.e., n_cores = length(path_to_eq_classes)
). Since
at least 2 transcripts are necessary to study differential splicing,
genes with a single transcript are not analyzed.
In our example data, reads were aligned to the genome with
STAR, and salmon was then used to compute the
equivalence classes (and quantify transcript abundance) on the aligned
reads; therefore we set salmon_or_kallisto = "salmon"
.
input_data = create_data(salmon_or_kallisto = "salmon",
gene_to_transcript = gene_tr_id,
salmon_path_to_eq_classes = equiv_classes_files,
eff_len = eff_len,
n_cores = 2,
transcripts_to_keep = transcripts_to_keep)
## Data has been loaded
## Max 11 transcripts per group
## Max 5 genes per group
If transcripts pre-filtering is not wanted, do not specify
transcripts_to_keep
parameter.
After loading the data, with
filter_genes(data, min_counts_per_gene = 20)
, we remove
genes with less than 20 counts overall (i.e., considering all
equivalence classes across all samples).
## Initial number of genes: 40; number of selected genes: 40
When reads have been aligned with kallisto
, we proceed
in a very similar way as above.
We specify the path to the equivalence classes
(kallisto_equiv_classes
) and respective counts
(kallisto_equiv_counts
) computed by
kallisto
.
kallisto_equiv_classes = file.path(data_dir, "kallisto", sample_names, "pseudoalignments.ec")
kallisto_equiv_counts = file.path(data_dir, "kallisto", sample_names, "pseudoalignments.tsv")
file.exists(kallisto_equiv_classes); file.exists(kallisto_equiv_counts)
## [1] TRUE TRUE TRUE TRUE
## [1] TRUE TRUE TRUE TRUE
Warning: as above, the sample names in
kallisto_equiv_classes
and
kallisto_equiv_classes
must have the same order as those in
the design object, containted in samples_design
.
## [1] "/tmp/RtmpAODvWl/Rinst1a2c11219bc2/BANDITS/extdata/kallisto/sample1/pseudoalignments.ec"
## [2] "/tmp/RtmpAODvWl/Rinst1a2c11219bc2/BANDITS/extdata/kallisto/sample2/pseudoalignments.ec"
## [3] "/tmp/RtmpAODvWl/Rinst1a2c11219bc2/BANDITS/extdata/kallisto/sample3/pseudoalignments.ec"
## [4] "/tmp/RtmpAODvWl/Rinst1a2c11219bc2/BANDITS/extdata/kallisto/sample4/pseudoalignments.ec"
## [1] "/tmp/RtmpAODvWl/Rinst1a2c11219bc2/BANDITS/extdata/kallisto/sample1/pseudoalignments.tsv"
## [2] "/tmp/RtmpAODvWl/Rinst1a2c11219bc2/BANDITS/extdata/kallisto/sample2/pseudoalignments.tsv"
## [3] "/tmp/RtmpAODvWl/Rinst1a2c11219bc2/BANDITS/extdata/kallisto/sample3/pseudoalignments.tsv"
## [4] "/tmp/RtmpAODvWl/Rinst1a2c11219bc2/BANDITS/extdata/kallisto/sample4/pseudoalignments.tsv"
## [1] "sample1" "sample2" "sample3" "sample4"
As above, we import the equivalence classes and respective counts,
and create a BANDITS_data
object via
create_data
.
input_data_2 = create_data(salmon_or_kallisto = "kallisto",
gene_to_transcript = gene_tr_id,
kallisto_equiv_classes = kallisto_equiv_classes,
kallisto_equiv_counts = kallisto_equiv_counts,
kallisto_counts = counts,
eff_len = eff_len, n_cores = 2,
transcripts_to_keep = transcripts_to_keep)
## Data has been loaded
## Max 69 transcripts per group
## Max 29 genes per group
## A 'BANDITS_data' object with 4 samples and 40 genes.
If transcripts pre-filtering is not wanted, do not specify
transcripts_to_keep
parameter.
After loading the data, with
filter_genes(data, min_counts_per_gene = 20)
, we remove
genes with less than 20 counts overall (i.e., considering all
equivalence classes across all samples).
## Initial number of genes: 40; number of selected genes: 40
In this Section we illustrate how to formulate an informative prior for the precision parameter (i.e., the Dirichlet-Multinomial parameter modelling the degree of over-dispersion between samples). Note that this is an optional, yet highly recommended, step.
The prior_precision
function builds on top of
DRIMSeq’s (Nowicka and Robinson
2016) DRIMSeq::dmPrecision
function which provides
genewise estimates of the precision parameter. Use the same filtering
criteria as in create_data
, by choosing the same argument
for transcripts_to_keep
. If transcript pre-filtering is not
performed, leave transcripts_to_keep
unspecified.
set.seed(61217)
precision = prior_precision(gene_to_transcript = gene_tr_id,
transcript_counts = counts, n_cores = 2,
transcripts_to_keep = transcripts_to_keep)
## Estimating gene-wise precision parameters
## Estimation completed
The first element of the result contains the mean and standard deviation of the log-precision estimates.
## [1] 3.658696 3.622948
Plot the histogram of the genewise log-precision estimates. The black solid line represents the normally distributed prior distribution for the log-precision parameter.
With test_DTU
, we jointly run the MCMC algorithm, to
infer the posterior distributions of the parameters, and test for DTU.
mean_log_delta
and sd_log_delta
represent the
mean and standard deviation of the informative prior for the
log-precision parameter, if available. If an informative prior was not
computed, leave mean_log_delta
and
sd_log_delta
fields unspecified.
R
and burn_in
represent the length of the
MCMC chain (excluding the burn-in) and the length of the burn-in (i.e.,
the initial portion of the chain which is discarded). For genes that are
analyzed together (because one or more reads are compatible with
multiple genes), R
and burn_in
are doubled to
face the increased complexity of the inferential problem. The method
requires at least R = 10^4
and
burn_in = 2*10^3
. Albeit no difference was observed in
simulation studies when increasing these numbers, we encourage users to
possibly use higher values (e.g., double) if the computational time
allows it.
A convergence diagnostic is used to test if the posterior chains are stationary and to determine if a further fraction of the chain should be discarded as burn-in. If convergence is not reached, the chain is discarded and a second chain is run; if convergence is again not reached, a third chain is run: if three consecutive chains fail to converge, the respective gene is not tested for DTU.
It is highly suggested to speed up computations by parallelizing the
method and specifying the number of parallel threads via the
n_cores
parameter. Before running the MCMC, we set the seed
for the random number generation in R.
For genes with a p.value below 0.1, test_DTU
runs a
second independent MCMC chain, merges it with the first one and tests
again for DTU based on the aggregated chain.
The method can technically be run with a single observation per group, however 2 in each group should be regarded as the very minimum sample size.
We run the DTU method. group_col_name
indicates the name
of the column of samples_design
containing the group id of
each sample (by default group_col_name = "group"
).
set.seed(61217)
results = test_DTU(BANDITS_data = input_data,
precision = precision$prior,
samples_design = samples_design,
group_col_name = "group",
R = 10^4, burn_in = 2*10^3, n_cores = 2,
gene_to_transcript = gene_tr_id)
## Starting the MCMC
## MCMC completed
## Returning results
The output of test_DTU
is a BANDITS_test
object; results are stored in 3 data.frame
objects
containing gene level results, transcript level results and convergence
output. All results are sorted, by default, according to the
significance of the gene level test.
To read a full description of the output from test_DTU
,
see help(BANDITS_test)
.
## A 'BANDITS_test' object, with 36 gene and 114 transcript level results.
Functions top_genes
, top_transcripts
and
convergence
can be used to access gene level results,
transcript level results and convergence output, respectively.
Visualize the most significant Genes, sorted by gene level significance.
## Gene_id p.values adj.p.values p.values_inverted
## 1 ENSG00000162585 9.132076e-05 0.003287547 9.132076e-05
## 2 ENSG00000197530 1.491652e-02 0.268497286 1.491652e-02
## 3 ENSG00000157870 2.467498e-02 0.296099768 2.467498e-02
## 4 ENSG00000160087 9.415837e-02 0.847425318 9.415837e-02
## 5 ENSG00000008130 1.364419e-01 0.851547520 1.364419e-01
## 6 ENSG00000224870 1.523895e-01 0.851547520 1.523895e-01
## adj.p.values_inverted DTU_measure Mean log-prec A Mean log-prec B
## 1 0.003287547 1.3291221 3.864902 5.115116
## 2 0.268497286 0.6243606 3.450269 5.575319
## 3 0.296099768 0.7229124 5.319788 3.295278
## 4 0.847425318 0.7565747 4.332407 3.276718
## 5 0.851547520 0.7482182 4.455511 3.510488
## 6 0.851547520 0.8338453 4.830025 3.853537
## SD log-prec A SD log-prec B
## 1 1.402252 2.293613
## 2 1.049785 1.911045
## 3 2.247572 3.011678
## 4 1.622281 1.464809
## 5 3.213585 2.063185
## 6 2.715598 3.100057
Alternatively, gene-level results can also be sorted according to “DTU_measure”, which is a measure of the strength of the change between average relative abundances of the two groups.
## Gene_id p.values adj.p.values p.values_inverted
## 1 ENSG00000162585 9.132076e-05 0.003287547 9.132076e-05
## 6 ENSG00000224870 1.523895e-01 0.851547520 1.523895e-01
## 4 ENSG00000160087 9.415837e-02 0.847425318 9.415837e-02
## 5 ENSG00000008130 1.364419e-01 0.851547520 1.364419e-01
## 3 ENSG00000157870 2.467498e-02 0.296099768 2.467498e-02
## 2 ENSG00000197530 1.491652e-02 0.268497286 1.491652e-02
## adj.p.values_inverted DTU_measure Mean log-prec A Mean log-prec B
## 1 0.003287547 1.3291221 3.864902 5.115116
## 6 0.851547520 0.8338453 4.830025 3.853537
## 4 0.847425318 0.7565747 4.332407 3.276718
## 5 0.851547520 0.7482182 4.455511 3.510488
## 3 0.296099768 0.7229124 5.319788 3.295278
## 2 0.268497286 0.6243606 3.450269 5.575319
## SD log-prec A SD log-prec B
## 1 1.402252 2.293613
## 6 2.715598 3.100057
## 4 1.622281 1.464809
## 5 3.213585 2.063185
## 3 2.247572 3.011678
## 2 1.049785 1.911045
Visualize the most significant transcripts, sorted by transcript level significance.
## Gene_id Transcript_id p.values adj.p.values Max_Gene_Tr.p.val
## 1 ENSG00000162585 ENST00000378546 3.861658e-06 0.000440229 0.7209274
## 2 ENSG00000162585 ENST00000476803 1.083969e-04 0.006178621 0.7830943
## 10 ENSG00000157870 ENST00000419916 7.710927e-03 0.293015229 0.4790838
## 5 ENSG00000197530 ENST00000514234 2.281786e-02 0.502429907 0.5440071
## 6 ENSG00000197530 ENST00000487053 2.537921e-02 0.502429907 0.2681072
## 30 ENSG00000131584 ENST00000492936 2.644368e-02 0.502429907 0.2790023
## Max_Gene_Tr.Adj.p.val Mean A Mean B SD A SD B
## 1 0.003287547 0.7209274 0.04278565 0.12135398 0.09376301
## 2 0.006178621 0.1321139 0.78309429 0.11263834 0.14019984
## 10 0.296099768 0.4790838 0.03315788 0.14799243 0.07341999
## 5 0.502429907 0.1724052 0.54400706 0.09319581 0.13862266
## 6 0.502429907 0.2681072 0.01534846 0.10035926 0.04941305
## 30 0.987793066 0.2512082 0.08663353 0.05229618 0.05471718
Visualize the convergence output for the most significant genes, sorted by gene level significance.
## Gene_id converged burn_in
## 1 ENSG00000162585 TRUE 0
## 2 ENSG00000197530 TRUE 0
## 3 ENSG00000157870 TRUE 0
## 4 ENSG00000160087 TRUE 0
## 5 ENSG00000008130 TRUE 0
## 6 ENSG00000224870 TRUE 0
We can further use the gene
function to gather all
output for a specific gene: gene level, transcript level and convergence
results.
## $gene_results
## Gene_id p.values adj.p.values p.values_inverted
## 1 ENSG00000162585 9.132076e-05 0.003287547 9.132076e-05
## adj.p.values_inverted DTU_measure Mean log-prec A Mean log-prec B
## 1 0.003287547 1.329122 3.864902 5.115116
## SD log-prec A SD log-prec B
## 1 1.402252 2.293613
##
## $transcript_results
## Gene_id Transcript_id p.values adj.p.values Max_Gene_Tr.p.val
## 1 ENSG00000162585 ENST00000378546 3.861658e-06 0.000440229 0.7209274
## 2 ENSG00000162585 ENST00000476803 1.083969e-04 0.006178621 0.7830943
## 3 ENSG00000162585 ENST00000414253 9.183543e-01 0.999945452 0.9183543
## 4 ENSG00000162585 ENST00000497675 9.773652e-01 0.999945452 0.9773652
## Max_Gene_Tr.Adj.p.val Mean A Mean B SD A SD B
## 1 0.003287547 0.72092739 0.04278565 0.12135398 0.09376301
## 2 0.006178621 0.13211392 0.78309429 0.11263834 0.14019984
## 3 0.999945452 0.10138194 0.11783651 0.04353412 0.06672338
## 4 0.999945452 0.04557675 0.05628355 0.02149286 0.04648564
##
## $convergence_results
## Gene_id converged burn_in
## 1 ENSG00000162585 TRUE 0
Similarly we can use the transcript
function to gather
all output for a specific transcript.
## $transcript_results
## Gene_id Transcript_id p.values adj.p.values Max_Gene_Tr.p.val
## 1 ENSG00000162585 ENST00000378546 3.861658e-06 0.000440229 0.7209274
## Max_Gene_Tr.Adj.p.val Mean A Mean B SD A SD B
## 1 0.003287547 0.7209274 0.04278565 0.121354 0.09376301
##
## $gene_results
## Gene_id p.values adj.p.values p.values_inverted
## 1 ENSG00000162585 9.132076e-05 0.003287547 9.132076e-05
## adj.p.values_inverted DTU_measure Mean log-prec A Mean log-prec B
## 1 0.003287547 1.329122 3.864902 5.115116
## SD log-prec A SD log-prec B
## 1 1.402252 2.293613
##
## $convergence_results
## Gene_id converged burn_in
## 1 ENSG00000162585 TRUE 0
Finally, we can plot the estimated average transcript relative
expression in the two groups for a specific gene via
plot_proportions
. When CI = TRUE
(default), a
solid black line is plotted on top of the histograms, indicating the
profile Wald type confidence interval (CI) of each transcript relative
expression; the level of the CI can be set via CI_level
parameter (0.95 by default). Note that the width of the CIs is a
consequence of the limited ammount of available data (i.e., few counts);
the boundaries are usually much smaller in real datasets.
In this Section we aim to explain in detail the output of
test_DTU
.
In both, gene- and transcript-level tests, p.values
and
adj.p.values
indicate the p.values and adjusted p.values,
where adjusted p.values are obtained via p.adjust
, by
implementing Benjamini and Hochberg correction.
In gene level results, only for two-group comparisons, we also
propose a conservative measure, p.values_inverted
, which
accounts for the inversion of the dominant transcript (i.e., the most
expressed transcript). If the dominant transcript is the same under both
groups, $p.values\_inverted =
\sqrt{p.values}$, while if the dominant transcript varies between
the two groups, p.values_inverted = p.values.
In other words, when the dominant transcript is unchanged between
conditions, we take the square root of the p.value, which results in an
inflated value (e.g., $\sqrt{0.01} =
0.1$). This measure is based on the observation that often
differential splicing leads to a change in the dominant transcript and,
given similar p.values, it will rank higher genes with different
dominant transcritps between conditions.
We also propose a score, DTU_measure
, again only defined
for two-group comparisons, which is intended to measure the intensity of
the DTU change, similarly to fold changes in differential expression
analyses. Consider a gene with K transcripts with relative abundance
π1(A), …, πK(A),
for group A, and π1(B), …, πK(B),
for group B.
DTU_measure
is defined as the summation of the absolute
difference between the two most expressed transcripts: ∑k ∈ K̃|πk(A) − πk(B)|,
where K̃ indicates the set of
two most expressed transcripts across both groups (i.e., adding πk(A)
and πk(B)).
Note that this measure ranges between 0, when proportions are identical
between groups, and 2, when an isoform is always expressed in group A
and a different transcript is always chosen in group B.
Finally, Mean log-prec group_name
and
SD log-prec group_name
indicate the posterior mean and
standard deviation of log(δ),
i.e., the logarithm of the Dirichlet precision parameter in each group.
The precision parameter models the degree of over-dispersion between
samples: the higher the precision parameter (or its logarithm), the
lower the sample-to-sample variability.
In transcript level results, Max_Gene_Tr.p.val
and
Max_Gene_Tr.Adj.p.val
are two conservative transcript level
measures which account for both, the gene- and transcript-level
p.values: they are the maximum between the gene and transcript level
p.values and adjusted p.values, respectively. With these measures, a
transcript can only be detected as significant if the corresponding gene
is also significant.
Finally, Mean group_name
and SD group_name
indicate the posterior mean and standard deviation of each transcript
mean relative abundance.
If 3 or more groups are available inference is carried out as in the case with 2 groups described above.
Here we propose re-analyze the previous data assuming a three-group structure. The pipeline exposed above is unchanged, except for the design matrix, which now includes three groups.
samples_design_3_groups = data.frame(sample_id = sample_names,
group = c("A", "B", "B", "C"))
samples_design_3_groups
## sample_id group
## 1 sample1 A
## 2 sample2 B
## 3 sample3 B
## 4 sample4 C
## NULL
Perform differential splicing:
set.seed(61217)
results_3_groups = test_DTU(BANDITS_data = input_data,
precision = precision$prior,
samples_design = samples_design_3_groups,
group_col_name = "group",
R = 10^4, burn_in = 2*10^3, n_cores = 2,
gene_to_transcript = gene_tr_id)
## Starting the MCMC
## MCMC completed
## Returning results
## A 'BANDITS_test' object, with 36 gene and 114 transcript level results.
Below we visualize gene- and transcript-level results tables Note
that NaN
or NA
appear when no counts are
available for a specific group of samples; in such cases, the remainig
two groups of samples are compared.
## Gene_id p.values adj.p.values Mean log-prec A Mean log-prec B
## 1 ENSG00000162585 0.0001565409 0.005635472 5.307830 4.572408
## 2 ENSG00000197530 0.0119764713 0.215576483 4.486540 4.109291
## 3 ENSG00000127054 0.0227727710 0.273273252 5.410571 4.944628
## 4 ENSG00000162591 0.0644455925 0.580010333 5.719586 4.369761
## 5 ENSG00000008130 0.1266662790 0.911997209 4.318826 4.037817
## 6 ENSG00000224870 0.1973177388 0.999620960 NaN 5.016170
## Mean log-prec C SD log-prec A SD log-prec B SD log-prec C
## 1 4.022116 2.151686 1.612514 1.751569
## 2 5.744296 2.177195 1.666262 1.479708
## 3 6.709370 1.322932 1.336821 1.083596
## 4 3.895671 3.470601 2.583375 2.499606
## 5 5.076597 2.730004 2.981370 2.507951
## 6 3.895747 NA 3.195189 3.283830
## Gene_id Transcript_id p.values adj.p.values Max_Gene_Tr.p.val
## 1 ENSG00000162585 ENST00000378546 7.646984e-05 0.008717562 0.7070509
## 2 ENSG00000162585 ENST00000476803 1.880597e-03 0.071462679 0.7385681
## 3 ENSG00000162585 ENST00000414253 3.327616e-01 0.997039172 0.3327616
## 4 ENSG00000162585 ENST00000497675 8.904123e-01 0.997039172 0.8904123
## 5 ENSG00000197530 ENST00000487053 2.241618e-02 0.638861230 0.2998206
## 6 ENSG00000197530 ENST00000514234 3.105753e-02 0.708111614 0.5408052
## Max_Gene_Tr.Adj.p.val Mean A Mean B Mean C SD A SD B
## 1 0.008717562 0.69564755 0.70705089 0.055926247 0.14381546 0.15853735
## 2 0.071462679 0.05365946 0.20204923 0.738568148 0.10198989 0.14900018
## 3 0.997039172 0.18681612 0.05376669 0.137455290 0.07617196 0.04093619
## 4 0.997039172 0.06387687 0.03713319 0.068050315 0.04612854 0.02042690
## 5 0.638861230 0.24743125 0.29982063 0.007190108 0.10656616 0.15745540
## 6 0.708111614 0.09004714 0.34861448 0.540805194 0.09533680 0.14327694
## SD C
## 1 0.11958157
## 2 0.18138274
## 3 0.10610313
## 4 0.05271668
## 5 0.02103195
## 6 0.15765804
We can visualize results from specific gene or transcript.
## $gene_results
## Gene_id p.values adj.p.values Mean log-prec A Mean log-prec B
## 1 ENSG00000162585 0.0001565409 0.005635472 5.30783 4.572408
## Mean log-prec C SD log-prec A SD log-prec B SD log-prec C
## 1 4.022116 2.151686 1.612514 1.751569
##
## $transcript_results
## Gene_id Transcript_id p.values adj.p.values Max_Gene_Tr.p.val
## 1 ENSG00000162585 ENST00000378546 7.646984e-05 0.008717562 0.7070509
## 2 ENSG00000162585 ENST00000476803 1.880597e-03 0.071462679 0.7385681
## 3 ENSG00000162585 ENST00000414253 3.327616e-01 0.997039172 0.3327616
## 4 ENSG00000162585 ENST00000497675 8.904123e-01 0.997039172 0.8904123
## Max_Gene_Tr.Adj.p.val Mean A Mean B Mean C SD A SD B
## 1 0.008717562 0.69564755 0.70705089 0.05592625 0.14381546 0.15853735
## 2 0.071462679 0.05365946 0.20204923 0.73856815 0.10198989 0.14900018
## 3 0.997039172 0.18681612 0.05376669 0.13745529 0.07617196 0.04093619
## 4 0.997039172 0.06387687 0.03713319 0.06805031 0.04612854 0.02042690
## SD C
## 1 0.11958157
## 2 0.18138274
## 3 0.10610313
## 4 0.05271668
##
## $convergence_results
## Gene_id converged burn_in
## 12 ENSG00000162585 TRUE 0
## $transcript_results
## Gene_id Transcript_id p.values adj.p.values Max_Gene_Tr.p.val
## 1 ENSG00000162585 ENST00000378546 7.646984e-05 0.008717562 0.7070509
## Max_Gene_Tr.Adj.p.val Mean A Mean B Mean C SD A SD B
## 1 0.008717562 0.6956476 0.7070509 0.05592625 0.1438155 0.1585374
## SD C
## 1 0.1195816
##
## $gene_results
## Gene_id p.values adj.p.values Mean log-prec A Mean log-prec B
## 1 ENSG00000162585 0.0001565409 0.005635472 5.30783 4.572408
## Mean log-prec C SD log-prec A SD log-prec B SD log-prec C
## 1 4.022116 2.151686 1.612514 1.751569
##
## $convergence_results
## Gene_id converged burn_in
## 12 ENSG00000162585 TRUE 0
Finally, we can plots the estimated mean transcript relative expression of a specific gene.
If all samples belong to the same experimental condition, differential testing between conditions cannot be performed. Nonetheless, BANDITS can still be used to infer group-level parameters (i.e., mean relative abundance of transcripts and dispersion).
The pipeline is identical to the case exposed above: the only difference concerns the design matrix, which now includes a single group for all samples.
samples_design_1_group = data.frame(sample_id = sample_names,
group = c("A", "A", "A", "A"))
samples_design_1_group
## sample_id group
## 1 sample1 A
## 2 sample2 A
## 3 sample3 A
## 4 sample4 A
## NULL
Inference is again performed via test_DTU
function (even
though the differential testing itself is not implemented).
set.seed(61217)
results_1_group = test_DTU(BANDITS_data = input_data,
precision = precision$prior,
samples_design = samples_design_1_group,
group_col_name = "group",
R = 10^4, burn_in = 2*10^3, n_cores = 2,
gene_to_transcript = gene_tr_id)
## One group only is present in 'samples_design$group':
## BANDITS will infer model parameters, but will not test for DTU between groups.
## Starting the MCMC
## MCMC completed
## Returning results
## A 'BANDITS_test' object, with 40 gene and 124 transcript level results.
Gene- and transcript-level results can be visualized as above; results are sorted by gene name. Note that now all columns relative to DTU testing are missing (p.values, adjusted p.values, DTU_measure, etc…).
## Gene_id Mean log-prec A SD log-prec A
## 1 ENSG00000008130 3.886154 2.4905563
## 2 ENSG00000069424 5.219230 3.0897909
## 3 ENSG00000078369 4.161549 2.2753614
## 4 ENSG00000078808 8.093806 0.4912126
## 5 ENSG00000107404 10.980998 3.7645846
## 6 ENSG00000116198 4.331888 3.1714630
## Gene_id Transcript_id Mean A SD A
## 1 ENSG00000008130 ENST00000341991 0.6104389 0.15831450
## 2 ENSG00000008130 ENST00000341426 0.3895611 0.15831450
## 3 ENSG00000069424 ENST00000378097 0.9551271 0.07772535
## 4 ENSG00000069424 ENST00000378083 0.0448729 0.07772535
## 5 ENSG00000078369 ENST00000378609 0.8892735 0.09303573
## 6 ENSG00000078369 ENST00000610897 0.1107265 0.09303573
We can focus on the results of a specific gene or transcript.
## $gene_results
## Gene_id Mean log-prec A SD log-prec A
## 1 ENSG00000008130 3.886154 2.490556
##
## $transcript_results
## Gene_id Transcript_id Mean A SD A
## 1 ENSG00000008130 ENST00000341991 0.6104389 0.1583145
## 2 ENSG00000008130 ENST00000341426 0.3895611 0.1583145
##
## $convergence_results
## Gene_id converged burn_in
## 3 ENSG00000008130 TRUE 0
## $transcript_results
## Gene_id Transcript_id Mean A SD A
## 1 ENSG00000008130 ENST00000341991 0.6104389 0.1583145
##
## $gene_results
## Gene_id Mean log-prec A SD log-prec A
## 1 ENSG00000008130 3.886154 2.490556
##
## $convergence_results
## Gene_id converged burn_in
## 3 ENSG00000008130 TRUE 0
Finally, we can plots the estimated mean transcript relative expression of a specific gene.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tximport_1.33.0 BANDITS_1.23.0 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.48 bslib_0.8.0
## [4] ggplot2_3.5.1 lattice_0.22-6 vctrs_0.6.5
## [7] tools_4.4.1 stats4_4.4.1 parallel_4.4.1
## [10] tibble_3.2.1 DRIMSeq_1.33.0 fansi_1.0.6
## [13] highr_0.11 pkgconfig_2.0.3 data.table_1.16.2
## [16] S4Vectors_0.43.2 rngtools_1.5.2 lifecycle_1.0.4
## [19] GenomeInfoDbData_1.2.13 farver_2.1.2 compiler_4.4.1
## [22] stringr_1.5.1 statmod_1.5.0 munsell_0.5.1
## [25] codetools_0.2-20 GenomeInfoDb_1.41.2 htmltools_0.5.8.1
## [28] sys_3.4.3 buildtools_1.0.0 sass_0.4.9
## [31] yaml_2.3.10 pillar_1.9.0 jquerylib_0.1.4
## [34] MASS_7.3-61 BiocParallel_1.39.0 cachem_1.1.0
## [37] limma_3.61.12 doRNG_1.8.6 iterators_1.0.14
## [40] foreach_1.5.2 locfit_1.5-9.10 digest_0.6.37
## [43] stringi_1.8.4 reshape2_1.4.4 labeling_0.4.3
## [46] maketools_1.3.1 fastmap_1.2.0 grid_4.4.1
## [49] colorspace_2.1-1 cli_3.6.3 magrittr_2.0.3
## [52] utf8_1.2.4 withr_3.0.2 edgeR_4.3.21
## [55] scales_1.3.0 UCSC.utils_1.1.0 rmarkdown_2.28
## [58] XVector_0.45.0 httr_1.4.7 evaluate_1.0.1
## [61] knitr_1.48 GenomicRanges_1.57.2 IRanges_2.39.2
## [64] doParallel_1.0.17 rlang_1.1.4 Rcpp_1.0.13
## [67] glue_1.8.0 BiocManager_1.30.25 BiocGenerics_0.51.3
## [70] jsonlite_1.8.9 R6_2.5.1 plyr_1.8.9
## [73] zlibbioc_1.51.2