Title: | Splicing Diversity Analysis for Transcriptome Data |
---|---|
Description: | The SplicingFactory R package uses transcript-level expression values to analyze splicing diversity based on various statistical measures, like Shannon entropy or the Gini index. These measures can quantify transcript isoform diversity within samples or between conditions. Additionally, the package analyzes the isoform diversity data, looking for significant changes between conditions. |
Authors: | Peter A. Szikora [aut], Tamas Por [aut], Endre Sebestyen [aut, cre] |
Maintainer: | Endre Sebestyen <[email protected]> |
License: | GPL-3 + file LICENSE |
Version: | 1.15.0 |
Built: | 2024-11-27 05:11:23 UTC |
Source: | https://github.com/bioc/SplicingFactory |
Calculate splicing diversity changes between two conditions.
calculate_difference( x, samples, control, method = "mean", test = "wilcoxon", randomizations = 100, pcorr = "BH", assayno = 1, verbose = FALSE, ... )
calculate_difference( x, samples, control, method = "mean", test = "wilcoxon", randomizations = 100, pcorr = "BH", assayno = 1, verbose = FALSE, ... )
x |
A |
samples |
A vector of length one, specifying the column name of the
|
control |
Name of the control sample category, defined in the
|
method |
Method to use for calculating the average splicing diversity
value in a condition. Can be |
test |
Method to use for p-value calculation: use |
randomizations |
Number of random shuffles, used for the label shuffling test (default = 100). |
pcorr |
P-value correction method applied to the Wilcoxon rank sum test
or label shuffling test results, as defined in the |
assayno |
An integer value. In case of multiple assays in a
|
verbose |
If |
... |
Further arguments to be passed on for other methods. |
The function calculates diversity changes between two sample
conditions. It uses the output of the diversity calculation function, which
is a SummarizedExperiment
object of splicing diversity values.
Additionally, it can use a data.frame
as input, where the first column
contains gene names, and all additional columns contain splicing diversity
values for each sample. A vector of sample conditions also serves as input,
used for aggregating the samples by condition.
It calculates the mean or median of the splicing diversity data per sample condition, the difference of these values and the log2 fold change of the two conditions. Furthermore, the user can select a statistical method to calculate the significance of the changes. The p-values and adjusted p-values are calculated using a Wilcoxon sum rank test or label shuffling test.
The function will exclude genes of low sample size from the significance calculation, depending on which statistical test is applied.
A data.frame
with the mean or median values of splicing
diversity across sample categories and all samples, log2(fold change) of
the two different conditions, raw and corrected p-values.
# data.frame with splicing diversity values x <- data.frame(Genes = letters[seq_len(10)], matrix(runif(80), ncol = 8)) # sample categories samples <- c(rep('Healthy', 4), rep('Pathogenic', 4)) # To calculate the difference of splicing diversity changes between the # 'Healthy' and 'Pathogenic' condition together with the significance values, # using mean and Wilcoxon rank sum test, use: calculate_difference(x, samples, control = 'Healthy', method = 'mean', test = 'wilcoxon')
# data.frame with splicing diversity values x <- data.frame(Genes = letters[seq_len(10)], matrix(runif(80), ncol = 8)) # sample categories samples <- c(rep('Healthy', 4), rep('Pathogenic', 4)) # To calculate the difference of splicing diversity changes between the # 'Healthy' and 'Pathogenic' condition together with the significance values, # using mean and Wilcoxon rank sum test, use: calculate_difference(x, samples, control = 'Healthy', method = 'mean', test = 'wilcoxon')
Main function for calculating splicing diversity
calculate_diversity( x, genes = NULL, method = "laplace", norm = TRUE, tpm = FALSE, assayno = 1, verbose = FALSE )
calculate_diversity( x, genes = NULL, method = "laplace", norm = TRUE, tpm = FALSE, assayno = 1, verbose = FALSE )
x |
A numeric |
genes |
Character vector with equal length to the number of rows of the
input dataset with transcript-level expression values. The values in
|
method |
Method to use for splicing diversity calculation, including
naive entropy ( |
norm |
If |
tpm |
In the case of a tximport list, TPM values or raw read counts can
serve as an input. If |
assayno |
An integer value. In case of multiple assays in a
|
verbose |
If |
The function is intended to process transcript-level expression data from RNA-seq or similar datasets.
Given a N x M matrix or similar data structure, where the N rows are transcripts and the M columns are samples, and a vector of gene ids, used for aggregating the transcript level data, the function calculates transcript diversity values for each gene in each sample. These diversity values can be used to investigate the dominance of a specific transcript for a gene, the diversity of transcripts in a gene, and analyze changes in diversity.
There are a number of diversity values implemented in the package. These include the following:
Naive entropy: Shannon entropy using the transcript frequencies as probabilities. 0 entropy means a single dominant transcript, higher values mean a more diverse set of transcripts for a gene.
Laplace entropy: Shannon entropy where the transcript frequencies are replaced by a Bayesian estimate, using Laplace's prior.
Gini index: a measure of statistical dispersion originally used in economy. This measurement ranges from 0 (complete equality) to 1 (complete inequality). A value of 1 (complete inequality) means a single dominant transcript.
Simpson index: a measure of diversity, characterizing the number of different species (transcripts of a gene) in a dataset. Originally, this measurement calculates the probability that randomly selected individuals belong to different species. Simpson index ranges between 0 and 1; the higher the value, the higher the diversity.
Inverse Simpson index: Similar concept as the Simpson index, although a higher inverse-Simpson index means greater diversity. It ranges between 1 and the total number of transcripts for a gene.
The function can calculate the gene level diversity index using any kind of expression measure, including raw read counts, FPKM, RPKM or TPM values, although results may vary.
Gene-level splicing diversity values in a SummarizedExperiment
object.
# matrix with RNA-seq read counts x <- matrix(rpois(60, 10), ncol = 6) colnames(x) <- paste0("Sample", 1:6) # gene names used for grouping the transcript level data gene <- c(rep("Gene1", 3), rep("Gene2", 2), rep("Gene3", 3), rep("Gene4", 2)) # calculating normalized Laplace entropy result <- calculate_diversity(x, gene, method = "laplace", norm = TRUE)
# matrix with RNA-seq read counts x <- matrix(rpois(60, 10), ncol = 6) colnames(x) <- paste0("Sample", 1:6) # gene names used for grouping the transcript level data gene <- c(rep("Gene1", 3), rep("Gene2", 2), rep("Gene3", 3), rep("Gene4", 2)) # calculating normalized Laplace entropy result <- calculate_diversity(x, gene, method = "laplace", norm = TRUE)
Calculate entropy for a vector of transcript-level expression values of one gene.
calculate_entropy(x, norm = TRUE, pseudocount = 0)
calculate_entropy(x, norm = TRUE, pseudocount = 0)
x |
Vector of expression values. |
norm |
If |
pseudocount |
Pseudocount added to each transcript expression value. Default is 0, while Laplace entropy uses a pseudocount of 1. |
The function calculates an entropy value as part of different diversity calculations. Given a vector of transcript-level expression values of a gene, this function characterizes the diversity of splicing isoforms for a gene. If there only a single transcript, the diversity value will be NaN, as it cannot be calculated. If the expression of the given gene is 0, the diversity value will be NA.
A single gene-level entropy value.
# read counts for the transcripts of a single gene with 5 transcripts x <- rnbinom(5, size = 10, prob = 0.4) # calculate non-normalized naive entropy value entropy <- calculate_entropy(x, norm = FALSE) # calculate Laplace-entropy, also normalized for transcript number # (the default) norm_laplace_entropy <- calculate_entropy(x, pseudocount = 1)
# read counts for the transcripts of a single gene with 5 transcripts x <- rnbinom(5, size = 10, prob = 0.4) # calculate non-normalized naive entropy value entropy <- calculate_entropy(x, norm = FALSE) # calculate Laplace-entropy, also normalized for transcript number # (the default) norm_laplace_entropy <- calculate_entropy(x, pseudocount = 1)
Calculate splicing diversity changes between two conditions.
calculate_fc(x, samples, control, method = "mean")
calculate_fc(x, samples, control, method = "mean")
x |
A |
samples |
Character vector with an equal length to the number of columns in the input dataset, specifying the category of each sample. |
control |
Name of the control sample category, defined in the
|
method |
Method to use for calculating the average splicing diversity
value in a condition. Can be |
The function uses a matrix of splicing diversity values in order to calculate mean or median differences and log2 fold changes between two conditions.
A data.frame
with mean or median value of splicing diversity
across sample categories, the difference between these values and the log2
fold change values.
Calculate Gini coefficient for a vector of transcript-level expression values of one gene.
calculate_gini(x)
calculate_gini(x)
x |
Vector of expression values. |
The function calculates a Gini coefficient as part of different diversity calculations. Given a vector of transcript-level expression values of a gene, this function characterize the diversity of splicing isoforms for a gene. If there only one single transcript, the resulted index will be NaN, as diversity cannot be calculated. If the expression of the given gene is 0, the diversity index will be NA.
A single gene-level Gini coefficient.
# read counts for the transcripts of a single gene with 5 transcripts x <- rnbinom(5, size = 10, prob = 0.4) # calculate Gini index gini <- calculate_gini(x)
# read counts for the transcripts of a single gene with 5 transcripts x <- rnbinom(5, size = 10, prob = 0.4) # calculate Gini index gini <- calculate_gini(x)
#' Calculate inverse Simpson index for a vector of transcript-level expression values of one gene.
calculate_inverse_simpson(x)
calculate_inverse_simpson(x)
x |
Vector of expression values. |
The function calculates an inverse Simpson index as part of different diversity calculations. Given a vector of transcript-level expression values of a gene, this function characterize the diversity of splicing isoforms for a gene. If there only one single transcript, the resulted index will be NaN, as diversity cannot be calculated. If the expression of the given gene is 0, the diversity index will be NA.
A single gene-level inverse Simpson index.
# read counts for the transcripts of a single gene with 5 transcripts x <- rnbinom(5, size = 10, prob = 0.4) # calculate inverse Simpson index invsimpson <- calculate_inverse_simpson(x)
# read counts for the transcripts of a single gene with 5 transcripts x <- rnbinom(5, size = 10, prob = 0.4) # calculate inverse Simpson index invsimpson <- calculate_inverse_simpson(x)
Calculate diversity values for a matrix of transcripts.
calculate_method(x, genes, method, norm = TRUE, verbose = FALSE)
calculate_method(x, genes, method, norm = TRUE, verbose = FALSE)
x |
An input |
genes |
Character vector with equal length to the number of rows of the
input dataset with transcript-level expression values. The values in
|
method |
Method to use for splicing diversity calculation, including
naive entropy ( |
norm |
If |
verbose |
If |
The function calculates diversity values on a matrix of
transcript-level expression values, aggregated by the genes defined in the
genes
parameter.
Gene-level splicing diversity values in a data.frame
, where
each row belongs to a gene and each column belongs to a sample from the
data, in addition to the first column, containing gene names, given in the
'genes' parameter.
#' Calculate Simpson index for a vector of transcript-level expression values of one gene.
calculate_simpson(x)
calculate_simpson(x)
x |
Vector of expression values. |
The function calculates a Simpson index as part of different diversity calculations. Given a vector of transcript-level expression values of a gene, this function characterize the diversity of splicing isoforms for a gene. If there only one single transcript, the resulted index will be NaN, as diversity cannot be calculated. If the expression of the given gene is 0, the diversity index will be NA.
A single gene-level Simpson index.
# read counts for the transcripts of a single gene with 5 transcripts x <- rnbinom(5, size = 10, prob = 0.4) # calculate Simpson index simpson <- calculate_simpson(x)
# read counts for the transcripts of a single gene with 5 transcripts x <- rnbinom(5, size = 10, prob = 0.4) # calculate Simpson index simpson <- calculate_simpson(x)
Calculate p-values using label shuffling.
label_shuffling( x, samples, control, method, randomizations = 100, pcorr = "BH" )
label_shuffling( x, samples, control, method, randomizations = 100, pcorr = "BH" )
x |
A |
samples |
Character vector with an equal length to the number of columns in the input dataset, specifying the category of each sample. |
control |
Name of the control sample category, defined in the
|
method |
Method to use for calculating the average splicing diversity
value in a condition. Can be |
randomizations |
The number of random shuffles. |
pcorr |
P-value correction method applied to the results, as defined in
the |
Raw and corrected p-values.
Data from The Cancer Genome Atlas, downloaded on 08th September, 2020. It contains transcript level read counts of 20 patients with Luminal A type breast cancer (primary tumor and solid normal samples).
data(tcga_brca_luma_dataset)
data(tcga_brca_luma_dataset)
A data frame with 996 rows and 41 columns. The first column contains gene names, all additional columns contain RNA-sequencing read counts for samples.
The Cancer Genome Atlas Network (2012) Nature 490, 61–70 (doi:10.1038/nature11412)
Calculate p-values using Wilcoxon rank sum test.
wilcoxon(x, samples, pcorr = "BH", paired = FALSE, exact = FALSE)
wilcoxon(x, samples, pcorr = "BH", paired = FALSE, exact = FALSE)
x |
A |
samples |
Character vector with an equal length to the number of columns in the input dataset, specifying the category of each sample. |
pcorr |
P-value correction method applied to the results, as defined in
the |
paired |
If |
exact |
If |
Raw and corrected p-values in a matrix.