The regsplice
package implements statistical methods for
the detection of differential exon usage (differential splicing) in RNA
sequencing (RNA-seq) and exon microarray data sets.
The regsplice
methods are based on the use of the lasso
(L1-regularization) to improve the power of standard generalized linear
models. A key advantage of regsplice
is that runtimes are
fast compared to other leading approaches. We anticipate that similar
regularization-based methods may also have applications in other
settings.
The detailed statistical methodology and performance comparisons with other methods will be described in an upcoming paper.
This vignette demonstrates an example workflow for the
regsplice
package, using a small simulated RNA-seq data
set.
There are two options for running regsplice
: you can run
a complete workflow in one step using the wrapper function
regsplice()
; or you can run the individual functions for
each step in sequence, which provides additional flexibility and insight
into the methodology. Both options are demonstrated below.
The data set used for the example workflow consists of exon-level read counts for a subset of 100 genes from a simulated human RNA-seq data set, consisting of 6 biological samples, with 3 samples in each of 2 conditions.
The original data set is from the paper:
Soneson, Matthes et al. (2016), Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage, Genome Biology, available here
Original data files from this paper, containing the simulated RNA-seq reads (FASTQ and BAM files), are available from ArrayExpress at accession code E-MTAB-3766.
Exon bin counts were generated with the Python counting scripts provided with the DEXSeq package, using the option to exclude exons from overlapping genes instead of aggregating them into multi-gene complexes (see Soneson et al. 2016, Supplementary Material).
For this example workflow, we have selected a subset consisting of
the first 100 genes from this simulated data set. The exon-level read
counts and the true differential splicing status labels for these 100
genes are saved in the text files vignette_counts.txt
and
vignette_truth.txt
in the extdata/
directory
in the regsplice
package source code.
The regsplice
methods are designed to work with both
RNA-seq read counts and exon microarray intensities.
If you are using exon microarray data, the main steps in the workflow are the same as shown below for RNA-seq data. However, the following adjustments to the workflow are required:
Instead of RNA-seq read counts, a matrix or data frame of exon
microarray intensities is provided to the counts
input
argument. The name of the argument is still counts
,
regardless of the input data type.
Exon microarray intensities should be log2-transformed
externally, before they are provided to regsplice
. This is
usually done during pre-processing of microarray data, and may be done
automatically depending on your software.
Filtering of zero-count and low-count exon bins should be
disabled, by setting the arguments filter_zeros = FALSE
and
filter_low_counts = FALSE
.
Calculation of normalization factors should be disabled, by
setting normalize = FALSE
.
Calculation of limma-voom
transformation and weights
should be disabled, by setting voom = FALSE
.
Load the vignette example data file, which contains simulated RNA-seq
read counts for 100 genes across 6 biological samples. From the raw
data, extract the table of counts (counts
), gene IDs
(gene_IDs
), and number of exon bins per gene
(n_exons
).
Then create the condition
vector, which specifies the
experimental conditions or treatment groups for each biological
sample.
# load data
file_counts <- system.file("extdata/vignette_counts.txt", package = "regsplice")
data <- read.table(file_counts, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
head(data)
## exon sample1 sample2 sample3 sample4 sample5 sample6
## 1 ENSG00000000003:001 576 506 526 643 482 826
## 2 ENSG00000000003:002 141 122 126 157 121 191
## 3 ENSG00000000003:003 123 102 106 133 99 156
## 4 ENSG00000000003:004 86 76 77 98 72 112
## 5 ENSG00000000003:005 97 83 87 113 76 126
## 6 ENSG00000000003:006 133 107 116 155 97 170
# extract counts, gene_IDs, and n_exons
counts <- data[, 2:7]
tbl_exons <- table(sapply(strsplit(data$exon, ":"), function(s) s[[1]]))
gene_IDs <- names(tbl_exons)
n_exons <- unname(tbl_exons)
dim(counts)
## [1] 3191 6
length(gene_IDs)
## [1] 100
head(gene_IDs)
## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
## [5] "ENSG00000000460" "ENSG00000000938"
length(n_exons)
## [1] 100
sum(n_exons)
## [1] 3191
# create condition vector
condition <- rep(c("untreated", "treated"), each = 3)
condition
## [1] "untreated" "untreated" "untreated" "treated" "treated" "treated"
The regsplice()
wrapper function runs the analysis
pipeline with a single command. The required input format for the
wrapper function is a RegspliceData
object, which is
created with the RegspliceData()
constructor function.
The results of a regsplice
analysis consist of a set of
multiple testing adjusted p-values (Benjamini-Hochberg false discovery
rates, FDR) quantifying the statistical evidence for differential exon
usage (DEU) for each gene. The adjusted p-values are used to rank the
genes in the data set according to their evidence for DEU, and a
significance threshold can be specified to generate a list of genes with
statistically significant evidence for DEU.
The wrapper function returns gene names, fitted model results, raw p-values, multiple testing adjusted p-values, likelihood ratio (LR) test statistics, and degrees of freedom of the LR tests.
The required inputs for the RegspliceData()
constructor
function are counts
(matrix or data frame of RNA-seq read
counts or exon microarray intensities), gene_IDs
(vector of
gene IDs), n_exons
(vector of exon lengths, i.e. number of
exon bins per gene), and condition
(vector of experimental
conditions for each biological sample).
Alternatively, the inputs can be provided as a
SummarizedExperiment
object, which will be parsed to
extract these components. This may be useful when running
regsplice
as part of a pipeline with other packages.
Note that we have used suppressWarnings()
to hide
warning messages related to the small number of observations per gene in
this example data set. For most data sets, these warning messages should
not be present.
See ?RegspliceData
and ?regsplice
for
additional details, including other available inputs and options. The
seed
argument is used to generate reproducible results.
library(regsplice)
rs_data <- RegspliceData(counts, gene_IDs, n_exons, condition)
rs_results <- suppressWarnings(regsplice(rs_data, seed = 123))
## removed 936 exon(s) with zero counts
## removed 1 remaining single-exon gene(s)
## removed 240 low-count exon(s)
## removed 4 remaining single-exon gene(s)
## Fitting regularized (lasso) models...
## Fitting null models...
The function summaryTable()
is used to generate a
summary table of the results.
The results are displayed as a data frame of the top n
most highly significant genes, ranked according to either the false
discovery rate (FDR) or raw p-values, up to a specified significance
threshold (e.g. FDR < 0.05).
The argument rank_by
chooses whether to rank by FDR or
raw p-values.
To display results for all genes up to the significance threshold,
set the argument n = Inf
. To display results for all genes
in the data set, set both n = Inf
and
threshold = 1
.
For more details, see ?summaryTable
.
Alternatively, the regsplice
analysis pipeline can be
run using the individual functions for each step, which provides
additional flexibility and insight into the statistical methodology. The
steps are described below.
The first step is to create a RegspliceData
object,
which contains data in the format required by the functions in the
regsplice
analysis pipeline.
The RegspliceData
format is based on the
SummarizedExperiment
container from Bioconductor. The main
advantage of this format is that subsetting operations automatically
keep data and meta-data for rows and columns in sync, which helps avoid
errors caused by selecting incorrect row or column indices.
Initially, the RegspliceData
objects contain raw data
along with meta-data for rows (genes and exon bins) and columns
(biological samples). During subsequent steps in the
regsplice
analysis pipeline, the data values are modified,
and additional data and meta-data are added. Final results are stored in
a RegspliceResults
object.
The required inputs are counts
(matrix or data frame of
RNA-seq read counts or exon microarray intensities),
gene_IDs
(vector of gene IDs), n_exons
(vector
of exon lengths, i.e. number of exon bins per gene), and
condition
(vector of experimental conditions for each
biological sample).
Alternatively, the inputs can be provided as a
SummarizedExperiment
object, which will be parsed to
extract these components. This may be useful when running
regsplice
as part of a pipeline with other packages.
Note that the warning messages are due to the small number of observations per gene in the data set used in this vignette. In most data sets, these warning messages should not be present.
For more details, see ?RegspliceData
.
Next, use the function filterZeros()
to filter exon bins
(rows) with zero counts in all biological samples (columns).
Any remaining single-exon genes are also removed (since differential splicing requires multiple exons).
If you are using exon microarray data, this step should be skipped.
For more details, see ?filterZeros
.
Filter low-count exon bins with filterLowCounts()
.
The arguments filter_min_per_exon
and
filter_min_per_sample
control the amount of filtering.
Default values are provided; however, these should be adjusted depending
on the total number of samples and the number of samples per
condition.
Any remaining single-exon genes are also removed.
If you are using exon microarray data, this step should be skipped.
For more details, see ?filterLowCounts
.
The function runNormalization()
calculates normalization
factors, which are used to scale library sizes.
By default, runNormalization()
uses the TMM (trimmed
mean of M-values) normalization method (Robinson and Oshlack, 2010),
implemented in the edgeR
package. For more details, see the
documentation for calcNormFactors()
in the
edgeR
package.
This step should be done after filtering. The normalization factors
are then used by limma-voom
in the next step.
If you are using exon microarray data, this step should be skipped.
For more details, see ?runNormalization
.
The next step is to use limma-voom
to transform the
counts and calculate exon-level weights. This is done with the
runVoom()
function.
The limma-voom
methodology transforms counts to
log2-counts per million (logCPM), and calculates exon-level weights
based on the observed mean-variance relationship. This is required
because raw or log-transformed counts do not fulfill the statistical
assumptions required for linear modeling (i.e. equal variance). After
the limma-voom
transformation and weights have been
calculated, linear modeling methods can be used.
For more details, see the following paper, which introduced
voom
; or the limma
User’s Guide (section “Differential splicing”) available on
Bioconductor.
Note that voom
assumes that exon bins (rows) with zero
or low counts have already been removed, so this step should be done
after filtering with filterZeros()
and
filterLowCounts()
.
If normalization factors are available (from previous step with
runNormalization()
), they will be used by voom
to calculate normalized library sizes. If they are not available,
voom
will use non-normalized library sizes (columnwise
total counts) instead.
If you are using exon microarray data, this step should be skipped.
For more details, see ?runVoom
.
rs_data <- runVoom(rs_data)
# view column meta-data including normalization factors and normalized library sizes
colData(rs_data)
## DataFrame with 6 rows and 4 columns
## sample_names condition norm_factors lib_sizes
## <character> <character> <numeric> <numeric>
## 1 sample1 untreated 0.872019 348678
## 2 sample2 untreated 1.003102 327180
## 3 sample3 untreated 1.082822 315879
## 4 sample4 treated 1.033876 305823
## 5 sample5 treated 0.983314 359353
## 6 sample6 treated 1.038511 300455
The initializeResults()
function creates a
RegspliceResults
object, which will contain the results of
the analysis. This object will be populated in the subsequent steps.
For more details, see ?initializeResults
.
There are three model fitting functions:
fitRegMultiple()
fits regularized (lasso) models
containing an optimal subset of exon:condition interaction terms for
each gene. The model fitting procedure penalizes the interaction terms
only, so that the main effect terms for exons and samples are always
included. This ensures that the null model is nested, allowing
likelihood ratio tests to be calculated.
fitNullMultiple()
fits the null models, which do not
contain any interaction terms.
fitFullMultiple()
fits “full” models, which contain
all exon:condition interaction terms for each gene.
The fitting functions fit models for all genes in the data set.
Note that we have used suppressWarnings()
to hide
warning messages related to the small number of observations per gene in
this example data set. For most data sets, these warning messages should
not be present.
For more details, see ?fitRegMultiple
,
?fitNullMultiple
, or ?fitFullMultiple
.
# set random seed for reproducibility
seed <- 123
# fit regularized models
rs_results <- suppressWarnings(fitRegMultiple(rs_results, rs_data, seed = seed))
## Fitting regularized (lasso) models...
# fit null models
rs_results <- fitNullMultiple(rs_results, rs_data, seed = seed)
## Fitting null models...
# fit "full" models (not required if 'when_null_selected = "ones"' in next step)
rs_results <- fitFullMultiple(rs_results, rs_data, seed = seed)
## Fitting full models...
The function LRTests()
calculates likelihood ratio (LR)
tests between the fitted models and null models.
If the fitted regularized (lasso) model contains at least one
exon:condition interaction term, the LR test compares the lasso model
against the nested null model. However, if the lasso model contains zero
interaction terms, then the lasso and null models are identical, so the
LR test cannot be calculated. The when_null_selected
argument lets the user choose what to do in these cases: either set
p-values equal to 1 (when_null_selected = "ones"
); or
calculate a LR test using the “full” model containing all exon:condition
interaction terms (when_null_selected = "full"
), which
reduces power due to the larger number of terms, but allows the evidence
for differential exon usage among these genes to be distinguished. You
can also return NAs for these genes
(when_null_selected = "NA"
).
The default option is when_null_selected = "ones"
. This
simply calls all these genes non-significant, which in most cases is
sufficient since we are more interested in genes with strong evidence
for differential exon usage. However, if it is important to rank the
low-evidence genes in your data set, use the
when_null_selected = "full"
option. If
when_null_selected = "ones"
or
when_null_selected = "NA"
, the “full” fitted models are not
required.
The results object contains gene names, fitted model results, raw p-values, multiple testing adjusted p-values (Benjamini-Hochberg false discovery rates, FDR), likelihood ratio (LR) test statistics, and degrees of freedom of the LR tests.
For more details, see ?LRTests
.
The function summaryTable()
is used to generate a
summary table of the results.
The results are displayed as a data frame of the top n
most highly significant genes, ranked according to either the false
discovery rate (FDR) or raw p-values, up to a specified significance
threshold (e.g. FDR < 0.05).
The argument rank_by
chooses whether to rank by FDR or
raw p-values.
To display results for all genes up to the significance threshold,
set the argument n = Inf
. To display results for all genes
in the data set, set both n = Inf
and
threshold = 1
.
For more details, see ?summaryTable
.
For the simulated data set in this vignette, the true differential splicing status of each gene is known. In this section, we show how to analyze the results and calculate a contingency table showing the number of true positives, true negatives, false positives, and false negatives.
As shown in the workflow above, we can use the
summaryTable()
function with argument n = Inf
to display a list of all genes with significant evidence for
differential exon usage (DEU).
summaryTable(rs_results, n = Inf)
## gene_IDs p_vals p_adj LR_stats df_tests
## 1 ENSG00000004766 1.963457e-17 6.086718e-16 137.05717 25
## 2 ENSG00000001461 8.635485e-06 1.338500e-04 26.20603 3
## 3 ENSG00000003436 1.316879e-04 1.360775e-03 17.87015 2
The total number of genes with significant evidence for DEU at a given threshold can also be calculated.
Note that we are using the multiple testing adjusted p-values (Benjamini-Hochberg false discovery rates, FDRs) for this calculation. A standard threshold of FDR < 0.05 implies that 5% of genes in the list are expected to be false discoveries.
As mentioned above, the true differential splicing (DS) status is known for each gene, since this is a simulated data set. Therefore, we can calculate contingency tables comparing the true and predicted DS status for each gene at a given significance threshold. Increasing the significance threshold returns more genes, at the expense of a larger number of false positives.
# load true DS status labels
file_truth <- system.file("extdata/vignette_truth.txt", package = "regsplice")
data_truth <- read.table(file_truth, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
str(data_truth)
## 'data.frame': 100 obs. of 2 variables:
## $ gene : chr "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
## $ ds_status: int 0 0 0 0 0 0 0 0 1 0 ...
# remove genes that were filtered during regsplice analysis
data_truth <- data_truth[data_truth$gene %in% gene_IDs(rs_results), ]
dim(data_truth)
## [1] 81 2
length(gene_IDs(rs_results))
## [1] 81
# number of true DS genes in simulated data set
sum(data_truth$ds_status == 1)
## [1] 6
table(data_truth$ds_status)
##
## 0 1
## 75 6
# contingency table comparing true and predicted DS status for each gene
# (significance threshold: FDR < 0.05)
table(true = data_truth$ds_status, predicted = p_adj(rs_results) < 0.05)
## predicted
## true FALSE TRUE
## 0 73 2
## 1 5 1
# increasing the threshold detects more genes, at the expense of more false positives
table(true = data_truth$ds_status, predicted = p_adj(rs_results) < 0.99)
## predicted
## true FALSE TRUE
## 0 69 6
## 1 4 2
Additional user options not discussed in the workflow above include:
alpha
: Elastic net parameter for glmnet
model fitting functions. The value of alpha
must be between
0 (ridge regression) and 1 (lasso). The default value is 1, which fits a
lasso model. See glmnet
package documentation for more
details.lambda_choice
: Parameter to select which optimal
lambda
value to choose from the cv.glmnet
cross validation fit. Available choices are "lambda.min"
(model with minimum cross-validated error) and "lambda.1se"
(most regularized model with cross-validated error within one standard
error of minimum). The default value is "lambda.min"
. See
glmnet
package documentation for more details.For further details, including a complete list and description of all
available user options, refer to the documentation for the
regsplice()
wrapper function, which can be accessed with
?regsplice
or help(regsplice)
.
The function createDesignMatrix()
creates the model
design matrix for each gene. This function is called automatically by
the model fitting functions, so does not need to be used directly. In
this section, we demonstrate how it works for a single gene, and show an
example design matrix, in order to provide further insight into the
statistical methodology.
The design matrix includes main effect terms for each exon and each sample, and interaction terms between the exons and conditions.
Note that the design matrix does not include main effect terms for the conditions, since these are absorbed into the main effect terms for the samples. In addition, the design matrix does not include an intercept column, since it is simpler to let the model fitting functions add an intercept term later.
For more details, see ?createDesignMatrix
.
# gene with 3 exons
# 4 biological samples; 2 samples in each of 2 conditions
design_example <- createDesignMatrix(condition = rep(c(0, 1), each = 2), n_exons = 3)
design_example
## Exon2 Exon3 Samp2 Samp3 Samp4 Exon2:Cond1 Exon3:Cond1
## 1 0 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0
## 3 0 1 0 0 0 0 0
## 4 0 0 1 0 0 0 0
## 5 1 0 1 0 0 0 0
## 6 0 1 1 0 0 0 0
## 7 0 0 0 1 0 0 0
## 8 1 0 0 1 0 1 0
## 9 0 1 0 1 0 0 1
## 10 0 0 0 0 1 0 0
## 11 1 0 0 0 1 1 0
## 12 0 1 0 0 1 0 1