Title: | Scalable Aberrant Splicing and Expression Retrieval |
---|---|
Description: | saseR is a highly performant and fast framework for aberrant expression and splicing analyses. The main functions are: \itemize{ \item \code{\link{BamtoAspliCounts}} - Process BAM files to ASpli counts \item \code{\link{convertASpli}} - Get gene, bin or junction counts from ASpli SummarizedExperiment \item \code{\link{calculateOffsets}} - Create an offsets assays for aberrant expression or splicing analysis \item \code{\link{saseRfindEncodingDim}} - Estimate the optimal number of latent factors to include when estimating the mean expression \item \code{\link{saseRfit}} - Parameter estimation of the negative binomial distribution and compute p-values for aberrant expression and splicing } For information upon how to use these functions, check out our vignette at \url{https://github.com/statOmics/saseR/blob/main/vignettes/Vignette.Rmd} and the saseR paper: Segers, A. et al. (2023). Juggling offsets unlocks RNA-seq tools for fast scalable differential usage, aberrant splicing and expression analyses. bioRxiv. \url{https://doi.org/10.1101/2023.06.29.547014}. |
Authors: | Alexandre Segers [aut, cre], Jeroen Gilis [ctb], Mattias Van Heetvelde [ctb], Elfride De Baere [ctb], Lieven Clement [ctb] |
Maintainer: | Alexandre Segers <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.3.0 |
Built: | 2024-11-19 06:31:48 UTC |
Source: | https://github.com/bioc/saseR |
SummarizedExperiment containing gene counts, bin counts and junction counts in the metadata slots, together with some rowData information. Output from BamtoAspliCounts.
data(saseRExample)
data(saseRExample)
Loads all data needed to run vignette and examples
BamtoAspliCounts is used to convert BAM files to a SummarizedExperiment that contains gene, bin and junction level counts in the metadata slot. This function is adapted from the ASpli package. More information can be found in the corresponding package and paper.
BamtoAspliCounts( features, targets, minReadLength = 100, maxISize = 50000, libType = "PE", strandMode = 0, minAnchor = 0, threshold = 5, BPPARAM = bpparam() )
BamtoAspliCounts( features, targets, minReadLength = 100, maxISize = 50000, libType = "PE", strandMode = 0, minAnchor = 0, threshold = 5, BPPARAM = bpparam() )
features |
An object of class ASpliFeatures, obtained by using the binGenome function of ASpli. |
targets |
A dataframe containing sample, bam and experimental factors columns. |
minReadLength |
Minimum read length of sequenced library. It is used for computing E1I and IE2 read summarization. Make sure this number is smaller than the maximum read length in every bam file, otherwise no E1I or IE2 will be found |
maxISize |
Maximum intron expected size. Junctions longer than this size will be dicarded. |
libType |
Defines how reads will be treated according their sequencing library type (paired (PE, default) or single end (SE)). |
strandMode |
controls the behavior of the strand getter. It indicates how the strand of a pair should be inferred from the strand of the first and last alignments in the pair. 0: strand of the pair is always *. 1: strand of the pair is strand of its first alignment. This mode should be used when the paired-end data was generated using one of the following stranded protocols: Directional Illumina (Ligation), Standard SOLiD. 2: strand of the pair is strand of its last alignment. This mode should be used when the paired-end data was generated using one of the following stranded protocols: dUTP, NSR, NNSR, Illumina stranded TruSeq PE protocol. For more information see ?strandMode |
minAnchor |
Minimum percentage of read that should be aligned to an exon-intron boundary. |
threshold |
Minimum number of reads supporting junctions. Default=5. |
BPPARAM |
object of class |
A 'SummarizedExperiment' instance, containing 'gene', 'bin' and 'junction' counts in the metadata slot.
data(saseRExample, package = "saseR") bamFileNames <- paste(rep(c("A", "B"), each = 6), rep(c("C", "D"), 2, each = 3), paste0(rep(0:2, 2), ".bam"), sep = "_") BAMFiles <- system.file("extdata", bamFileNames, package = "saseR") targets <- data.frame( row.names = paste0('Sample',c(1:12)), bam = BAMFiles, f1 = rep("A",12), stringsAsFactors = FALSE) ASpliSE <- BamtoAspliCounts( features = features, targets = targets, minReadLength = 100, libType = "SE" )
data(saseRExample, package = "saseR") bamFileNames <- paste(rep(c("A", "B"), each = 6), rep(c("C", "D"), 2, each = 3), paste0(rep(0:2, 2), ".bam"), sep = "_") BAMFiles <- system.file("extdata", bamFileNames, package = "saseR") targets <- data.frame( row.names = paste0('Sample',c(1:12)), bam = BAMFiles, f1 = rep("A",12), stringsAsFactors = FALSE) ASpliSE <- BamtoAspliCounts( features = features, targets = targets, minReadLength = 100, libType = "SE" )
calculateOffsets is used to calculate the offsets for aberrant expression or splicing analysis. it can use ordinary offsets such as 'edgeR' and 'DESeq2' offsets for aberrant expression analysis, or use the total counts aggregated per gene to perform aberrant splicing analysis. It returns a 'SummarizedExperiment' which includes an 'offsets' matrix in the assays slot.
calculateOffsets( se, method, aggregation = NULL, zeroCountOffsets = 1, zeroOffsets = 1, mergeGeneASpli = TRUE, filterna = TRUE, saveall = FALSE )
calculateOffsets( se, method, aggregation = NULL, zeroCountOffsets = 1, zeroOffsets = 1, mergeGeneASpli = TRUE, filterna = TRUE, saveall = FALSE )
se |
A 'SummarizedExperiment' instance generated with the SummarizedExperiment function of the SummarizedExperiment package. In the assay slot, provide the expression counts as an ordinary 'matrix', 'DataFrame', a 'sparseMatrix' or a 'DelayedMatrix'. When offsets are calculated for aberrant splicing ('AS') or other proportions ('proportion'), a rowData column is required which is used to group features over which the sum is taken as offset. |
method |
method used to calculate the offsets. ‘TMM' uses edgeR’s trimmed mean of M-values, ‘geommean' uses DESeq2’s size factors, 'unit' uses a unit matrix of offsets, 'AS' and 'proportion' use a rowData column to aggregate feature counts (e.g. calculating the total bin counts per gene) for aberrant splicing or other usage analyses. |
aggregation |
column of rowData used to aggregate feature counts. Only used when 'method' is 'AS' or 'proportion'. If not specified, the 'locus' column is used when available, followed by the 'symbol' column. |
zeroCountOffsets |
When using the 'AS' or 'proportion' method, it is possible that some offsets are 0. As offsets need to be a strict positive value, these offsets are changed to a strict positive number (defined by 'zeroOffsets', with a default of 1). The corresponding count can also be changed, with a default change to 1. Note that, although it is possible, it is not recommended to specify 'zeroCountOffsets' greater than 'zeroOffsets'. |
zeroOffsets |
When using the 'AS' or 'proportion' method, it is possible that some offsets are 0. As offsets need to be a strict positive value, these offsets are changed to a strict positive number (defined by 'zeroOffsets', with a default of 1). The corresponding count can also be changed, with a default change to 1. Note that, although it is possible, it is not recommended to specify 'zeroCountOffsets' greater than 'zeroOffsets'. |
mergeGeneASpli |
logical value. When ASpli is used to obtain feature counts, one can aggregate feature counts to calculate offsets based on both gene-level or ASpli-cluster. When 'mergeGeneASpli' is TRUE, one uses gene-level aggregation, except where no annotated gene is available, and use ASpli-cluster aggregation for these features. |
filterna |
logical value to filter NA locus values in rowData. These are not necessarily features without annotated gene. |
saveall |
logical value to save all aggregated counts and offsets. Default is FALSE to save memory. If TRUE, also counts and offsets of gene-level aggregation and ASpli-cluster aggregation are saved, instead of only using the final used counts and offsets. |
An updated 'SummarizedExperiment' instance, now including an 'offsets' matrix in the assays slot. When 'method' is 'AS' or 'proportion', an 'realCounts' matrix is available in the assays slot, which correspond to the original counts, while the 'counts' matrix in the assays slot is updated with to 'zeroCountOffsets' when the calculated offset is equal to 0. When 'saveall' is TRUE, also matrices 'Locuscounts', 'Locusoffsets', 'ASplicounts'and 'ASplioffsets' are saved in the assays slot, corresponding to the adapted counts and offsets using the 'locus' rowData column, and using the 'symbol' rowData column respectively.
data(saseRExample, package = "saseR") SEgenes <- calculateOffsets(SEgenes, method = "TMM") SEbins <- calculateOffsets(SEbins, method = "AS", aggregation = "locus") SEjunctions <- calculateOffsets(SEjunctions, method = "AS", aggregation = "symbol", mergeGeneASpli = TRUE)
data(saseRExample, package = "saseR") SEgenes <- calculateOffsets(SEgenes, method = "TMM") SEbins <- calculateOffsets(SEbins, method = "AS", aggregation = "locus") SEjunctions <- calculateOffsets(SEjunctions, method = "AS", aggregation = "symbol", mergeGeneASpli = TRUE)
convertASpli is used to obtain gene, bin or junction level counts from a SummarizedExperiment object containing these three in the metadata slot.
convertASpli(ASpliSE, type = "none", filter = TRUE, ...)
convertASpli(ASpliSE, type = "none", filter = TRUE, ...)
ASpliSE |
An SummarizedExperiment object obtained by the 'BamtoAspliCounts' function, which contains gene, bin and junction level counts in the metadata slot. |
type |
character vector specifying the counts to be extracted. Can be 'gene', 'bin' and 'junction'. |
filter |
logical value specifying to filter bin counts based on the type of bin. Regions that contain both intronic and exonic regions are filtered as the intron and exon bins are already defined in other features. |
... |
extra parameters for filtering. |
A 'SummarizedExperiment' instance, representing the 'gene', 'bin' or 'junction' counts as specified.
data(saseRExample, package = "saseR") SEgenes <- convertASpli(ASpliSE, type = "gene") SEbins <- convertASpli(ASpliSE, type = "bin") SEjunctions <- convertASpli(ASpliSE, type = "junction")
data(saseRExample, package = "saseR") SEgenes <- convertASpli(ASpliSE, type = "gene") SEbins <- convertASpli(ASpliSE, type = "bin") SEjunctions <- convertASpli(ASpliSE, type = "junction")
features used to count bam-files. These are obtained by binGenome from ASpli.
data(saseRExample)
data(saseRExample)
/
Loads all data needed to run vignette and examples
https://doi.org/10.1093/bioinformatics/btab141
Estefania Mancini, Andres Rabinovich, Javier Iserte, Marcelo Yanovsky, Ariel Chernomoretz, ASpli: integrative analysis of splicing landscapes through RNA-Seq assays, Bioinformatics, Volume 37, Issue 17, September 2021, Pages 2609–2616, https://doi.org/10.1093/bioinformatics/btab141
saseR example data, obtained from the ASpli package (https://bioconductor.org/packages/release/bioc/html/ASpli.html, 10.1093/bioinformatics/btab141).
data(saseRExample)
data(saseRExample)
Loads all data needed to run examples
saseRfindEncodingDim is used to search for the optimal number of latent factors included in the regression to search for aberrant expression or splicing. This optimal number of latent factors can be estimated by using the Gavish and Donoho threshold for singular values, or by using a denoising autoencoder, as described in our corresponding paper
saseRfindEncodingDim( se, method = "GD", analysis, dimensions = seq(2, min(100, ncol(se) - 2, nrow(se) - 1), 2), freq = 0.01, zScore = 3, sdlog = log(1.6), lnorm = TRUE, inj = "both", BPPARAM = bpparam(), aggregation, scale = TRUE, ... )
saseRfindEncodingDim( se, method = "GD", analysis, dimensions = seq(2, min(100, ncol(se) - 2, nrow(se) - 1), 2), freq = 0.01, zScore = 3, sdlog = log(1.6), lnorm = TRUE, inj = "both", BPPARAM = bpparam(), aggregation, scale = TRUE, ... )
se |
A 'SummarizedExperiment' instance generated with the SummarizedExperiment function of the SummarizedExperiment package. In the assay slot, provide the expression counts as an ordinary 'matrix', 'DataFrame', a 'sparseMatrix' or a 'DelayedMatrix'. 'colData' is a 'DataFrame' describing the samples in the experiment. Also, include 'offsets' in the assays slot, which can be calculated with the 'calculateOffsets' function. Finally, specify the experimental 'design' as a formula in the metadata slot. This formula must be based on the colData, and should be '~1' if no known covariates are included. |
method |
The method used to estimate the optimal number of latent factors included in the regression framework. Default is 'GD', which uses the Gavish and Donoho threshold for singular values. 'DAE' will use a Denoising autoencoder, but will require longer computation time without increased performance. |
analysis |
'AE' for aberrant expression analysis and 'AS' for aberrant splicing analysis. Used to insert corrupted counts when using the 'DAE' method. |
dimensions |
A vector containing the number of latent factors that are compared in the grid search for the optimal number of latent factors. Only used when using the 'DAE' method. |
freq |
the frequency of corrupted counts injected in the dateset. Only used when using the 'DAE' method. |
zScore |
The mean magnitude of the corrupted counts in the dataset. Only used when using the 'DAE' method. |
sdlog |
Standard deviation of the distribution of the corrupted counts on the log-scale. Only used when using the 'DAE' method and when lnorm is TRUE. |
lnorm |
If TRUE, the corrupted counts are simulated from a log-normal distribution with a mean of log(zScore) and standard deviation of sdlog. Only used when using the 'DAE' method. |
inj |
Injection of overexpression ('high'), underexpression ('low') or both ('both') types of corrupted counts. Only used when using the 'DAE' method. |
BPPARAM |
object of class |
aggregation |
character vector representing the column in the rowData to be used to calculate offsets when injecting corrupted counts according to aberrant splicing. Only used when method is 'DAE' and when analysis is 'AE'. |
scale |
boolean. If TRUE, the deviance residuals upon will be scaled to mean 0 and sd = 1 before estimating the latent factors. |
... |
Extra arguments for .fitRUV. |
An updated 'SummarizedExperiment' instance, now including: 'optimalEncDim' in the metadata slot, representing the estimated optimal number of latent factors. 'LatentFactorControl' in the metadata slot, which represents the method used to estimate the optimal number of latent factors ('GD' for Gavish and Donoho threshold, 'DAE' for denoising autoencoder). 'deviances' in the assays slot when using the 'GD' method, representing the deviance residuals calculated using edgeR with an intercept and known covariates. 'svd_u', 'svd_d' and 'svd_v' matrices in the metadata slot when using the 'GD' method, which represent the singular value decomposition of the deviance residuals. 'encDimTable' a data.table in the metadata slot which represents the area under the curve to search for corrupted counts at the different dimensions when using the 'DAE' method.
data(saseRExample, package = "saseR") SEgenes <- saseRfindEncodingDim(SEgenes, method = "GD") SEbins <- saseRfindEncodingDim(SEbins, method = "GD") SEjunctions <- saseRfindEncodingDim(SEjunctions, method = "GD")
data(saseRExample, package = "saseR") SEgenes <- saseRfindEncodingDim(SEgenes, method = "GD") SEbins <- saseRfindEncodingDim(SEbins, method = "GD") SEjunctions <- saseRfindEncodingDim(SEjunctions, method = "GD")
saseRfit is used to perform a negative binomial regression and to search for aberrant expression or splicing in the context of rare Mendelian disorders. To model aberrant splicing, it uses adapted offsets to model proportions. It uses a predefined or an estimated optimal number of latent factors in the regression. It returns the p-values for the prioritisation of aberrant expression or splicing.
saseRfit( se, analysis, dimensions = NULL, padjust = "BH", BPPARAM = bpparam(), fit = "fast", scale = TRUE, robustPCA = FALSE, ignore_samples = NULL )
saseRfit( se, analysis, dimensions = NULL, padjust = "BH", BPPARAM = bpparam(), fit = "fast", scale = TRUE, robustPCA = FALSE, ignore_samples = NULL )
se |
A 'SummarizedExperiment' instance generated with the SummarizedExperiment function of the SummarizedExperiment package. In the assay slot, provide the expression counts as an ordinary 'matrix', 'DataFrame', a 'sparseMatrix' or a 'DelayedMatrix'. 'colData' is a 'DataFrame' describing the samples in the experiment. Finally, specify the experimental design as a formula in the metadata slot. This formula must be based on the colData, or equal to ~1 in case of modelling only an intercept. If 'optimalEncDim' is available in the metadata slot, this number will be used as the number of latent factors. |
analysis |
'AE' for aberrant expression analysis, 'AS' for aberrant splicing analysis, and 'proportion' for any other kind of analysis using proportions. |
dimensions |
A single integer representing the number of latent factors that are used in the regression framework. If not specified, the 'optimalEncDim' in the metadata slot will be used. If nor 'optimalEncDim' is specified, the Gavish and Donoho threshold for singular values will be used. |
padjust |
the method used to compute FDR-adjusted p-values. Look at the documentation of the 'p.adjust' package for different options. Both Benjamini-Hochberg ('BH') or Benjamini-Yekutieli ('BY') are recommended. |
BPPARAM |
object of class |
fit |
character value. Both 'fast' or 'edgeR' can be used. 'fast' uses a quadratic mean-variance relationship for fast estimation of the mean when many covariates or latent factors are included, without compromising upon power. 'edgeR' can be used to perform parameter estimation , although not scaling well with many covariates or latent factors. |
scale |
logical value to scale the deviance residuals before performing the singular value decomposition. |
robustPCA |
logical value to perform robust PCA using the 'PcaHubert' function, which decreases the influence of outlier values on the estimated latent factors. |
ignore_samples |
character vector with names of samples that are not used when performing the latent factor estimation and estimation of the regression coefficients. |
An updated 'SummarizedExperiment' instance, now including a matrix of p-values ('pValue') and a matrix of FDR-adjusted p-values ('pValueAdjust') in the assay slot. Also 'pValueAdjustMethod' is available in the metadata slot, stating which method was used to obtain FDR-adjusted pvalues. If analysis is 'AS' or 'proportion', also 'pValuesLocus', an aggregated p-value is provided based on the 'locus' column in the rowData. This is calculated by taking the minimal p-values.
data(saseRExample, package = "saseR") SEgenes <- saseRfit(SEgenes, analysis = "AE", padjust = "BH", fit = "fast") SEbins <- saseRfit(SEbins, analysis = "AS", padjust = "BH", fit = "fast") SEjunctions <- saseRfit(SEjunctions, analysis = "AS", padjust = "BH", fit = "fast")
data(saseRExample, package = "saseR") SEgenes <- saseRfit(SEgenes, analysis = "AE", padjust = "BH", fit = "fast") SEbins <- saseRfit(SEbins, analysis = "AS", padjust = "BH", fit = "fast") SEjunctions <- saseRfit(SEjunctions, analysis = "AS", padjust = "BH", fit = "fast")
SummarizedExperiment containing bin-level counts to perform an example aberrant splicing analysis.
data(saseRExample)
data(saseRExample)
Loads all data needed to run vignette and examples
SummarizedExperiment containing gene-level counts to perform an example aberrant expression analysis.
data(saseRExample)
data(saseRExample)
Loads all data needed to run vignette and examples
SummarizedExperiment containing junction-level counts to perform an example aberrant splicing analysis
data(saseRExample)
data(saseRExample)
Loads all data needed to run vignette and examples