saseR
is a R package to perform aberrant expression and
splicing analyses in bulk RNA-seq datasets. It uses adapted offsets to
model aberrant splicing with the negative binomial framework, and
estimates the parameters on a scalable and efficient manner. In this
vignette, we first show how the different functions of the
saseR
package can be used, after which we show how to
perform differential usage
analysis with adapted offsets in
conventional bulk RNA-seq methods DESeq2 and edgeR.
All details about the saseR
model, statistical tests,
the use of adapted offsets to model proportions and the adapted
parameter estimation are described in our preprint (Segers et al. 2023).
In this vignette, we analyse the data from the ASpli
package (Mancini et al. 2021), to show how
a workflow for aberrant expression and splicing is done. However, we
note that this dataset is not specific for aberrant expression or
splicing analyses. The data consists of BAM-files which will be
converted into gene, bin,and junction reads, on which three different
saseR
analysis will be performed (aberrant expression and
two times an aberrant splicing analysis). Next, we show how a
differential usage/splicing analysis can be done using DESeq2 (Love, Huber, and Anders 2014) and edgeR (Robinson, McCarthy, and Smyth 2009) with
adapted offsets.
The package has two different categories of functions, with respectively two and three functions.
First, BAM-files need to be transformed to gene, bin or junctions counts. This is done by:
BamtoAspliCounts
, which transforms BAM-files to
different counts using adapted functions of the ASpli
package. The different counts are given in a
SummarizedExperiment
object in the metadata slots.
convertAspli
selects the relevant counts (gene, bins
or junctions) to be further used in the analyses.
Then, the chosen counts can be further used to perform an aberrant expression or splicing analysis with the following functions:
calculateOffsets
, which calculates the offset matrix
used in the analysis. These offsets can be ordinary edgeR
or DESeq2
offsets when performing an aberrant expression
analysis, or can be the total bin or junction counts aggregated per gene
when performing an aberrant splicing analysis.
saseRfindEncodingDim
estimates the optimal number of
latent factors to control for unknown confounders. This number of latent
factors is then further used in the final fit.
saseRfit
calculates the mean expression and
feature-specific dispersions for each feature in each sample,
controlling for both known and unknown confounders. It calculates
p-values which represent how extreme each observation is compared to its
estimated distribution.
Note that one can do a saseR analysis by using readily available count matrices, therefore not requiring to start from BAM-files.
saseR
can be installed from Bioconductor with:
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("saseR")
Alternatively, the development version of saseR
can be
downloaded with:
The following data corresponds to the vignette data from the
ASpli
package (Mancini et al.
2021).
The input data required for saseR are BAM-files with corresponding BAI files and a genome annotation in TxDb format. These BAM-files will be transformed to gene/bin/junction count matrices. When a count matrix is readily available, one can just use this to perform an aberrant expression or splicing analysis (see section 5 and 6) .
First, we specify the path to our GTF file, which we will later transform to the TxDb format. Also, we include the pathnames of the BAM-files we want to include in the analysis.
## [1] "/github/workspace/pkglib/ASpli/extdata/genes.mini.gtf"
## [1] "/github/workspace/pkglib/ASpli/extdata/A_C_0.bam"
## [2] "/github/workspace/pkglib/ASpli/extdata/A_C_1.bam"
## [3] "/github/workspace/pkglib/ASpli/extdata/A_C_2.bam"
## [4] "/github/workspace/pkglib/ASpli/extdata/A_D_0.bam"
## [5] "/github/workspace/pkglib/ASpli/extdata/A_D_1.bam"
## [6] "/github/workspace/pkglib/ASpli/extdata/A_D_2.bam"
Then, we define a targets dataframe, which contains the sample names,
BAM-file pathsand experimental factors column. For the case of rare
Mendelian disorders and outlier detection, we choose these experimental
factors to be all equal to A. Note that the experimental factors
included here will not further be used in the final analysis, but are
needed to compute the count matrices with functions from
ASpli
.
targets <- data.frame(
row.names = paste0('Sample',c(1:12)),
bam = BAMFiles,
f1 = rep("A",12),
stringsAsFactors = FALSE)
head(targets)
## bam f1
## Sample1 /github/workspace/pkglib/ASpli/extdata/A_C_0.bam A
## Sample2 /github/workspace/pkglib/ASpli/extdata/A_C_1.bam A
## Sample3 /github/workspace/pkglib/ASpli/extdata/A_C_2.bam A
## Sample4 /github/workspace/pkglib/ASpli/extdata/A_D_0.bam A
## Sample5 /github/workspace/pkglib/ASpli/extdata/A_D_1.bam A
## Sample6 /github/workspace/pkglib/ASpli/extdata/A_D_2.bam A
Next, we transform our GTF file to TxDb format, and extract the
features from the annotation using the binGenome
function
from saseR
, which is an adapted copy from
ASpli
as this function is deprecated. Note that the
binGenome function can take a while.
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
## * Number of extracted Genes = 10
## * Number of extracted Genes = 10
## * Number of extracted Exon Bins = 38
## * Number of extracted Exon Bins = 38
## * Number of extracted intron bins = 34
## * Number of extracted intron bins = 34
## * Number of extracted trascripts = 25
## * Number of extracted trascripts = 25
## * Number of extracted junctions = 23
## * Number of extracted junctions = 23
## * Number of AS bins (not include external) = 17
## * Number of AS bins (include external) = 17
## * Classified as:
## ES bins = 1 (6%)
## IR bins = 3 (18%)
## Alt5'ss bins = 2 (12%)
## Alt3'ss bins = 2 (12%)
## Multiple AS bins = 9 (53%)
## classified as:
## ES bins = 1 (11%)
## IR bins = 4 (44%)
## Alt5'ss bins = 2 (22%)
## Alt3'ss bins = 1 (11%)
## * Number of AS bins (not include external) = 17
## * Number of AS bins (include external) = 17
## * Classified as:
## ES bins = 1 (6%)
## IR bins = 3 (18%)
## Alt5'ss bins = 2 (12%)
## Alt3'ss bins = 2 (12%)
## Multiple AS bins = 9 (53%)
## classified as:
## ES bins = 1 (11%)
## IR bins = 4 (44%)
## Alt5'ss bins = 2 (22%)
## Alt3'ss bins = 1 (11%)
## Correcting Io ends, this might take a while...
## Genome binning completed
Now that we have the targets
data.frame containing the
samples and BAM-file names, and the features
extracted from
the genome annotation, we can obtain the count matrices that will be
used to detect aberrant expression. This can done with
BamtoAspliCounts
. Note that the counting procedure
(featurecounts) can take a while. If only one type of analysis is
preferred (e.g. only aberrant expression analysis on the gene counts),
one can do the feature counting with a different method, and start from
a count matrix at section 5 and 6. Make sure to select the right type of
libType
(SE
for single-end reads,
PE
for paired-end reads), and the
minReadLength
, which is the minimal number of bases a read
needs to be counted.
ASpliSE <- BamtoAspliCounts(
features = features,
targets = targets,
minReadLength = 100,
libType = "SE",
BPPARAM = MulticoreParam(1L)
)
## Junctions PJU completed
## Junctions PIR completed
## Junctions IR PIR completed
## Junctions AltSS PSI completed
## Junctions ES PSI completed
ASpliSE is a SummarizedExperiment object that contains the gene, bin
and junction counts, together with different annotations in the metadata
slots. We will extract these counts using the convertASpli
function.
We will show how to perform an aberrant expression analysis based on the gene level counts. Note that other feature counts (e.g. bin counts) can also be used to search for aberrant expression. We start from a SummarizedExperiment that contains the feature counts of interest.
First, a design formula is included in the metadata slot
design
. In this example, we do not include known
covariates, but only introduce an intercept. If one includes known
covariates in the formula, these covariates should be included in the
colData slot.
For an aberrant expression analysis, we first retain only the genes
that have sufficiently large counts. Then, we need to calculate the
offsets for our analysis. For aberrant expression analysis, we recommend
using TMM
or geommean
for edgeR and DESeq2
normalisation respectively. This results in an offsets matrix in the
assays slot.
Note that for aberrant splicing detection, first offsets are calculated before filtering features that have low counts.
## Warning in filterByExpr.DGEList(y, design = design, group = group, lib.size =
## lib.size, : All samples appear to belong to the same group.
Having defined the design
and the offsets
of the analysis, we can continue to estimate the optimal
number of latent confounders
included in the regression
framework. This is often needed when performing aberrant expression or
splicing analysis to improve power. We obtain the latent factors by
using RUV
(Risso et al.
2014), which performs a
singular value decomposition on the deviance residuals
.
Then, we recommend using the Gavish and Donoho (Gavish and Donoho 2014) threshold for singular
values (method = GD
) to choose upon the number of latent
factors included in the final regression model. This is much faster,
while having similar performance compared to using a denoising
autoencoder (method = DAE
).
Having estimated the optimal number of latent factors to include in
the regression framework, we can fit our final model, controlling for
both known confounders (specified in metadata slot design
)
and the latent factors based on the deviance residuals. The final
regression fit can be done with edgeR (fit = edgeR
), but is
set to default to our fast parameter estimation
(fit = fast
). This fast parameter estimation uses a
overdispersed quadratic mean-variance relationship, which enables matrix
multiplication while keeping parameter estimation unbiased. This is
often required in the context of rare diseases that need many latent
factors to control for, which leads to ordinary fitting procedures such
as DESeq2 and edgeR becoming slow. If the number of confounders included
in the regression model is small (known confounders + latent factors),
edgeR parameter estimation could be used, although it is not expected to
give an increase in performance. analysis
is specified to
AE
to choose an aberrant expression analysis.
This results in estimates in mean expression, feature specific dispersions and p-values calculated for each observation in each sample. Further analysing the top prioritised genes within each sample is possible using for example:
order <- apply(assays(SEgenes)$pValue, 2, order, decreasing = FALSE)
topPvalsGenes <- sapply(1:ncol(SEgenes),
function(x) {assays(SEgenes)$pValue[order[1:5,x],x]})
rownames(topPvalsGenes) <- NULL
topGenes <- sapply(1:ncol(SEgenes),
function(x) {rownames(SEgenes)[order[1:5,x]]})
significantgenes <- sapply(1:ncol(SEgenes),
function(x) {
rownames(SEgenes)[assays(SEgenes)$pValueAdjust[,x] < 0.05]})
head(topPvalsGenes)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.03102749 0.1159464 0.3990040 0.1286522 0.2695281 0.04821107 0.1310854
## [2,] 0.17011462 0.1335962 0.4043607 0.4014844 0.2751492 0.36674082 0.3675906
## [3,] 0.31623422 0.2436769 0.5734204 0.4410169 0.7156347 0.70575353 0.4863269
## [4,] 0.37490428 0.3439380 0.7622561 0.4970871 0.7222973 0.70865388 0.5156083
## [5,] 0.39977621 0.5405994 0.8160304 0.5674733 0.7552254 0.75059475 0.6228496
## [,8] [,9] [,10] [,11] [,12]
## [1,] 0.1115451 0.1421563 0.2208241 0.09214189 0.2371822
## [2,] 0.2951036 0.2621759 0.3283615 0.23906661 0.2664025
## [3,] 0.5051797 0.2869709 0.4140313 0.29306342 0.4882280
## [4,] 0.5307402 0.3739860 0.4719410 0.55818088 0.6595572
## [5,] 0.5677993 0.3777315 0.5070343 0.65837916 0.6978440
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] "GENE07" "GENE09" "GENE04" "GENE09" "GENE03" "GENE09" "GENE05" "GENE05"
## [2,] "GENE05" "GENE05" "GENE07" "GENE01" "GENE05" "GENE01" "GENE02" "GENE07"
## [3,] "GENE10" "GENE07" "GENE09" "GENE05" "GENE06" "GENE08" "GENE06" "GENE02"
## [4,] "GENE04" "GENE01" "GENE02" "GENE03" "GENE09" "GENE03" "GENE08" "GENE04"
## [5,] "GENE09" "GENE03" "GENE10" "GENE08" "GENE04" "GENE10" "GENE07" "GENE03"
## [,9] [,10] [,11] [,12]
## [1,] "GENE07" "GENE02" "GENE05" "GENE10"
## [2,] "GENE06" "GENE10" "GENE02" "GENE05"
## [3,] "GENE04" "GENE07" "GENE08" "GENE07"
## [4,] "GENE08" "GENE04" "GENE09" "GENE01"
## [5,] "GENE09" "GENE08" "GENE03" "GENE03"
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
##
## [[4]]
## character(0)
##
## [[5]]
## character(0)
##
## [[6]]
## character(0)
We will show how to perform an aberrant splicing analysis based on the bin and junction level counts. Note that other feature counts can also be used if proportions of interest are to be modelled, which only requires the use of the correct offsets. We start from a SummarizedExperiment that contains the feature counts of interest.
First, a design formula is included in the metadata slot
design
. In this example, we do not include known
covariates, but only introduce an intercept. If one includes known
covariates in the formula, these covariates should be included in the
colData slot.
Then, we need to calculate the offsets for our analysis. For an
aberrant splicing analysis, AS
or proportion
should be used as method to calculate offsets that allow to model
proportions. This results in an offsets matrix in the assays slot. We
specificy the column in the rowData slot which is used to aggregate
feature counts to obtain proper offsets (aggregation
). This
is the locus
column for the bins analysis, and the
symbol
or ASpliCluster
column for a junctions
analysis. For the junctions analysis, it is also possible to use
aggregation based on symbol
(= gene) whenever an annotated
gene is available, and switch to ASpliCluster
when the
junction is not linked to a specific gene. This is done by setting
mergeGeneASpli
equal to TRUE. Note that a new column in the
rowData will be made, locus
, which specifies the used
aggregation.
Note that also other rowData columns can be used to aggregate feature
counts, and also a predefined offsets matrix can be included as
offsets
in the assays slot.
## Using locus column in rowData to calculate offsets.
## 0 features filtered due to NA, '-', or 'noHit' locus.
SEjunctions <- calculateOffsets(SEjunctions,
method = "AS",
aggregation = "symbol",
mergeGeneASpli = TRUE)
## Using symbol column in rowData to calculate offsets.
## 0 features filtered due to NA, '-', or 'noHit' locus.
Then, after calculating the offsets for the aberrant splicing analysis, we filter the features that have very low counts. Note that for aberrant expression analyses, first the features are filtered, followed by calculating the offsets.
## Warning in filterByExpr.DGEList(y, design = design, group = group, lib.size =
## lib.size, : All samples appear to belong to the same group.
## Warning in filterByExpr.DGEList(y, design = design, group = group, lib.size =
## lib.size, : All samples appear to belong to the same group.
Having defined the design
and the offsets
of the analysis, we can continue to estimate the optimal
number of latent confounders
included in the regression
framework. This is often needed when performing aberrant expression or
splicing analysis to obtain better power. We obtain the latent factors
by using RUV
(Risso et al.
2014), which performs a
singular value decomposition on the deviance residuals
.
Then, we recommend using the Gavish and Donoho (Gavish and Donoho 2014) threshold for singular
values (method = GD
) to choose upon the number of latent
factors included in the final regression model. This is much faster,
while having similar performance compared to using a denoising
autoencoder (method = DAE
).
SEbins <- saseRfindEncodingDim(SEbins, method = "GD")
SEjunctions <- saseRfindEncodingDim(SEjunctions, method = "GD")
Having estimated the optimal number of latent factors to include in
the regression framework, we can fit our final model, controlling for
both known confounders (specified in metadata slot design
)
and the latent factors based on the deviance residuals. The final
regression fit can be done with edgeR (fit = edgeR
), but is
set to default to our fast parameter estimation
(fit = fast
). This fast parameter estimation uses a
overdispersed quadratic mean-variance relationship, which enables matrix
multiplication while keeping parameter estimation unbiased.This is often
required in the context of rare diseases that need many latent factors
to control for, which leads to ordinary fitting procedures such as edgeR
becoming slow. If the number of confounders included in the regression
model is small (known confounders + latent factors), edgeR parameter
estimation could be used, although it is not expected to give an
increase in performance. analysis
is specified on
AS
to choose an aberrant splicing analysis.
SEbins <- saseRfit(SEbins,
analysis = "AS",
padjust = "BH",
fit = "fast")
SEjunctions <- saseRfit(SEjunctions,
analysis = "AS",
padjust = "BH",
fit = "fast")
This results in estimates in mean expression, feature specific dispersions and p-values calculated for each observation in each sample. Further analysing the top prioritised features within each sample is possible using for example:
order <- apply(assays(SEbins)$pValue, 2, order, decreasing = FALSE)
topPvalsBins <- sapply(1:ncol(SEbins),
function(x) {assays(SEbins)$pValue[order[1:5,x],x]})
rownames(topPvalsBins) <- NULL
topBins <- sapply(1:ncol(SEbins),
function(x) {rownames(SEbins)[order[1:5,x]]})
significantBins <- sapply(1:ncol(SEbins),
function(x) {
rownames(SEbins)[assays(SEgenes)$pValueAdjust[,x] < 0.05]})
head(topPvalsBins)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.2920730 0.2479812 0.1773876 0.03983227 0.03029549 0.07030136 0.2820690
## [2,] 0.3452435 0.2606600 0.3482568 0.12819629 0.20612234 0.09166173 0.3182864
## [3,] 0.4411531 0.2685777 0.3999498 0.13273306 0.28950883 0.27493653 0.3284258
## [4,] 0.4670477 0.3588860 0.4597824 0.16778278 0.29595504 0.28393816 0.4128227
## [5,] 0.5075236 0.4212592 0.4953552 0.18583522 0.31699366 0.28819200 0.4463418
## [,8] [,9] [,10] [,11] [,12]
## [1,] 0.1505300 0.06543611 0.1561171 0.1020822 0.01881900
## [2,] 0.2690009 0.14449012 0.2302083 0.1439282 0.06356613
## [3,] 0.3026018 0.14901032 0.2543287 0.3428522 0.16000229
## [4,] 0.3408827 0.19215776 0.2752078 0.4655831 0.17319650
## [5,] 0.3827884 0.28273415 0.3054174 0.4674561 0.21068788
## [,1] [,2] [,3] [,4] [,5]
## [1,] "GENE08:E005" "GENE08:E005" "GENE08:E004" "GENE06:E004" "GENE08:E004"
## [2,] "GENE08:E004" "GENE08:E001" "GENE09:E004" "GENE03:E001" "GENE01:E002"
## [3,] "GENE01:E002" "GENE06:E004" "GENE08:E001" "GENE08:E004" "GENE01:E003"
## [4,] "GENE02:E002" "GENE03:E001" "GENE08:E003" "GENE09:E002" "GENE08:E002"
## [5,] "GENE05:E004" "GENE02:E002" "GENE07:E003" "GENE10:E004" "GENE03:E002"
## [,6] [,7] [,8] [,9] [,10]
## [1,] "GENE03:E001" "GENE10:E003" "GENE08:E002" "GENE07:E002" "GENE02:E002"
## [2,] "GENE03:E002" "GENE02:E001" "GENE08:E004" "GENE07:E001" "GENE02:E003"
## [3,] "GENE06:E004" "GENE06:E002" "GENE10:E004" "GENE10:E001" "GENE06:E001"
## [4,] "GENE09:E003" "GENE06:E003" "GENE07:E002" "GENE08:E002" "GENE03:E002"
## [5,] "GENE09:E001" "GENE09:E005" "GENE10:E001" "GENE05:E004" "GENE06:E003"
## [,11] [,12]
## [1,] "GENE02:E002" "GENE02:E002"
## [2,] "GENE05:E002" "GENE03:E002"
## [3,] "GENE10:E001" "GENE05:E002"
## [4,] "GENE09:E001" "GENE02:E003"
## [5,] "GENE10:E003" "GENE06:E001"
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
##
## [[4]]
## character(0)
##
## [[5]]
## character(0)
##
## [[6]]
## character(0)
Furthermore, also aggregated p-values are calculated based on the
newly formed locus
column in the rowData slot. These are
currently the minimal p-value of each feature belonging to the same
aggregation group, and can be found in the metadata slot under
pValuesLocus
.
order <- apply(metadata(SEbins)$pValuesLocus, 2, order, decreasing = FALSE)
topPvalsBinsAggregated <- sapply(1:ncol(SEbins),
function(x) {metadata(SEbins)$pValuesLocus[order[1:5,x],x]})
rownames(topPvalsBinsAggregated) <- NULL
topBinsAggregated <- sapply(1:ncol(SEbins),
function(x) {
rownames(metadata(SEbins)$pValuesLocus)[order[1:5,x]]})
significantBinsAggregated <- sapply(1:ncol(SEbins),
function(x) {
rownames(SEbins)[metadata(SEgenes)$pValuesLocus[,x] < 0.05]})
head(topPvalsBinsAggregated)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.2920730 0.2479812 0.1773876 0.03983227 0.03029549 0.07030136 0.2820690
## [2,] 0.4411531 0.2685777 0.3482568 0.12819629 0.20612234 0.27493653 0.3182864
## [3,] 0.4670477 0.3588860 0.4953552 0.13273306 0.31699366 0.28393816 0.3284258
## [4,] 0.5075236 0.4212592 0.5428406 0.16778278 0.32869031 0.39572189 0.4463418
## [5,] 0.5397520 0.4431609 0.5934711 0.18583522 0.34831266 0.47232361 0.4636188
## [,8] [,9] [,10] [,11] [,12]
## [1,] 0.1505300 0.06543611 0.1561171 0.1020822 0.01881900
## [2,] 0.3026018 0.14901032 0.2543287 0.1439282 0.06356613
## [3,] 0.3408827 0.19215776 0.2752078 0.3428522 0.16000229
## [4,] 0.4037764 0.28273415 0.4817044 0.4655831 0.21068788
## [5,] 0.4178444 0.29755667 0.5166177 0.4705838 0.43419989
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] "GENE08" "GENE08" "GENE08" "GENE06" "GENE08" "GENE03" "GENE10" "GENE08"
## [2,] "GENE01" "GENE06" "GENE09" "GENE03" "GENE01" "GENE06" "GENE02" "GENE10"
## [3,] "GENE02" "GENE03" "GENE07" "GENE08" "GENE03" "GENE09" "GENE06" "GENE07"
## [4,] "GENE05" "GENE02" "GENE10" "GENE09" "GENE10" "GENE08" "GENE09" "GENE02"
## [5,] "GENE09" "GENE01" "GENE02" "GENE10" "GENE06" "GENE02" "GENE04" "GENE01"
## [,9] [,10] [,11] [,12]
## [1,] "GENE07" "GENE02" "GENE02" "GENE02"
## [2,] "GENE10" "GENE06" "GENE05" "GENE03"
## [3,] "GENE08" "GENE03" "GENE10" "GENE05"
## [4,] "GENE05" "GENE05" "GENE09" "GENE06"
## [5,] "GENE06" "GENE08" "GENE08" "GENE08"
## [[1]]
## character(0)
##
## [[2]]
## character(0)
##
## [[3]]
## character(0)
##
## [[4]]
## character(0)
##
## [[5]]
## character(0)
##
## [[6]]
## character(0)
Here, we show how to perform a differential usage/splicing analysis, using adapted offsets in conventual bulk RNA-seq tools DESeq2/edgeR. This way, we avoid modelling both the counts and the other counts, which means that no sample-specific regression coefficients are estimated. Note that these steps are similar compared to a conventual DESeq2/edgeR analysis, where different offsets or normalisation is used. We refer to the DESeq2/edgeR vignettes for more information upon their package and functions. Also, we refer to the DEXSeq (Anders, Reyes, and Huber 2012) for more information upon the preprocessing of the annotation.
We use the same example dataset as in the aberrant analyses, and are loaded from the ASpli package. This includes an annotation file and BAM-files.
Next, we extract the exonic parts from the annotation file using the
GenomicFeatures
package.
## Warning in call_fun_in_txdbmaker("makeTxDbFromGFF", ...): makeTxDbFromGFF() has moved from GenomicFeatures to the txdbmaker package,
## and is formally deprecated in GenomicFeatures >= 1.59.1. Please call
## txdbmaker::makeTxDbFromGFF() to get rid of this warning.
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
Using summarizeOverlaps
, we count the number of overlaps
with each part of exonic parts of the annotation file. This way, we
obtain a SummarizedExperiment object with read counts for each bin.
These bin counts can then be further used to model differential
usage/splicing. Make sure to use the correct settings for
singleEnd
and ignore.strand
.
Now that we have a count matrix, we can further specify the known covariates, whichin this case is the condition between which we will test differential bin usage.
Instead of modelling differential usage/splicing by modelling both the counts and the other counts, we rather calculate the total bin-counts per gene. This will be further used as offsets in the negative binomial regression, allowing to model the proportion of bin counts over the total counts overlapping that gene. We here aggregate the read counts per gene, although the offset can be chosen flexible depending on the analysis of interest (e.g. aggregating transcript counts per gene to model differential transcript usage).
counts <- assays(se)$counts
offsetsGene <- aggregate(counts,
by = list("gene" = rowData(se)$gene_id), FUN = sum)
offsets <- offsetsGene[match(rowData(se)$gene_id, offsetsGene$gene), ] %>%
mutate("gene" = NULL) %>%
as.matrix()
Because currently it is not possible to include offsets that are equal to 0 in DESeq2/edgeR, we have to change these offsets. Here, we change the counts that correspond to an offset of 0 (these counts are also equal to 0 in a differential usage/splicing analysis) to 1, together with the corresponding offset. It is possible to change the offsets and/or counts to a number of choice, as long as the offsets are not equal to 0.
We can now perform a conventual DESeq2/edgeR analysis, using the custom offsets. For DESeq2, we include the offsets on the original scale, while for edgeR, we include these on the log-scale. For the DESeq2 analysis we can use the following code:
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = colData(se),
rowData = rowData(se),
design= ~ condition)
## converting counts to integer mode
## it appears that the last variable in the design formula, 'condition',
## has a factor level, 'control', which is not the reference level. we recommend
## to use factor(...,levels=...) or relevel() to set this as the reference level
## before proceeding. for more information, please see the 'Note on factor levels'
## in vignette('DESeq2').
normalizationFactors(dds) <- offsets
dds <- DESeq2::estimateDispersionsGeneEst(dds)
dispersions(dds) <- mcols(dds)$dispGeneEst
dds <- DESeq2::nbinomWaldTest(dds)
results_dds <- DESeq2::results(dds)
While for the edgeR analysis we can use the following code:
DGE <- DGEList(counts = counts)
DGE$offset <- log(offsets)
design <- model.matrix(~condition, data = colData(se))
DGE <- edgeR::estimateDisp(DGE, design = design)
fitDGE <- edgeR::glmFit(DGE, design= design)
results_DGE <- edgeR::glmLRT(fitDGE, coef = 2)
These results now have an interpretation in terms of proportions, with the interpretation dependent on the offsets used.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] GenomicAlignments_1.43.0 Rsamtools_2.23.0
## [3] Biostrings_2.75.1 XVector_0.47.0
## [5] DEXSeq_1.53.0 RColorBrewer_1.1-3
## [7] igraph_2.1.1 data.table_1.16.2
## [9] BiocParallel_1.41.0 PRROC_1.3.1
## [11] pracma_2.4.4 dplyr_1.1.4
## [13] GenomicFeatures_1.59.1 DESeq2_1.47.1
## [15] SummarizedExperiment_1.37.0 MatrixGenerics_1.19.0
## [17] matrixStats_1.4.1 GenomicRanges_1.59.1
## [19] GenomeInfoDb_1.43.1 ASpli_2.17.0
## [21] AnnotationDbi_1.69.0 IRanges_2.41.1
## [23] S4Vectors_0.45.2 Biobase_2.67.0
## [25] BiocGenerics_0.53.3 generics_0.1.3
## [27] edgeR_4.5.0 limma_3.63.2
## [29] saseR_1.3.0 knitr_1.49
## [31] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.4.2 BiocIO_1.17.0 bitops_1.0-9
## [4] filelock_1.0.3 tibble_3.2.1 XML_3.99-0.17
## [7] rpart_4.1.23 lifecycle_1.0.4 httr2_1.0.6
## [10] pbmcapply_1.5.1 lattice_0.22-6 ensembldb_2.31.0
## [13] MASS_7.3-61 backports_1.5.0 magrittr_2.0.3
## [16] Hmisc_5.2-0 sass_0.4.9 rmarkdown_2.29
## [19] jquerylib_0.1.4 yaml_2.3.10 Gviz_1.51.0
## [22] DBI_1.2.3 buildtools_1.0.0 abind_1.4-8
## [25] zlibbioc_1.52.0 purrr_1.0.2 AnnotationFilter_1.31.0
## [28] biovizBase_1.55.0 RCurl_1.98-1.16 nnet_7.3-19
## [31] VariantAnnotation_1.53.0 rappdirs_0.3.3 GenomeInfoDbData_1.2.13
## [34] maketools_1.3.1 genefilter_1.89.0 annotate_1.85.0
## [37] codetools_0.2-20 DelayedArray_0.33.2 DT_0.33
## [40] xml2_1.3.6 tidyselect_1.2.1 precrec_0.14.4
## [43] UCSC.utils_1.3.0 BiocFileCache_2.15.0 base64enc_0.1-3
## [46] jsonlite_1.8.9 Formula_1.2-5 survival_3.7-0
## [49] tools_4.4.2 progress_1.2.3 Rcpp_1.0.13-1
## [52] glue_1.8.0 gridExtra_2.3 SparseArray_1.7.2
## [55] xfun_0.49 withr_3.0.2 BiocManager_1.30.25
## [58] fastmap_1.2.0 latticeExtra_0.6-30 fansi_1.0.6
## [61] digest_0.6.37 R6_2.5.1 colorspace_2.1-1
## [64] jpeg_0.1-10 dichromat_2.0-0.1 biomaRt_2.63.0
## [67] RSQLite_2.3.8 UpSetR_1.4.0 utf8_1.2.4
## [70] tidyr_1.3.1 robustbase_0.99-4-1 rtracklayer_1.67.0
## [73] prettyunits_1.2.0 httr_1.4.7 htmlwidgets_1.6.4
## [76] S4Arrays_1.7.1 pkgconfig_2.0.3 gtable_0.3.6
## [79] blob_1.2.4 hwriter_1.3.2.1 sys_3.4.3
## [82] pcaPP_2.0-5 htmltools_0.5.8.1 geneplotter_1.85.0
## [85] ProtGenerics_1.39.0 scales_1.3.0 png_0.1-8
## [88] rstudioapi_0.17.1 rjson_0.2.23 checkmate_2.3.2
## [91] curl_6.0.1 cachem_1.1.0 stringr_1.5.1
## [94] foreign_0.8-87 restfulr_0.0.15 pillar_1.9.0
## [97] grid_4.4.2 vctrs_0.6.5 dbplyr_2.5.0
## [100] xtable_1.8-4 cluster_2.1.6 htmlTable_2.4.3
## [103] evaluate_1.0.1 mvtnorm_1.3-2 cli_3.6.3
## [106] locfit_1.5-9.10 compiler_4.4.2 rlang_1.1.4
## [109] crayon_1.5.3 rrcov_1.7-6 interp_1.1-6
## [112] plyr_1.8.9 stringi_1.8.4 deldir_2.0-4
## [115] txdbmaker_1.3.0 munsell_0.5.1 lazyeval_0.2.2
## [118] Matrix_1.7-1 BSgenome_1.75.0 hms_1.1.3
## [121] bit64_4.5.2 ggplot2_3.5.1 KEGGREST_1.47.0
## [124] statmod_1.5.0 memoise_2.0.1 bslib_0.8.0
## [127] DEoptimR_1.1-3 bit_4.5.0