Title: | Differential Expression Analysis for RNA-seq data through a robust variance component test |
---|---|
Description: | Differential Expression Analysis RNA-seq data with variance component score test accounting for data heteroscedasticity through precision weights. Perform both gene-wise and gene set analyses, and can deal with repeated or longitudinal data. Methods are detailed in: i) Agniel D & Hejblum BP (2017) Variance component score test for time-course gene set analysis of longitudinal RNA-seq data, Biostatistics, 18(4):589-604 ; and ii) Gauthier M, Agniel D, Thiébaut R & Hejblum BP (2020) dearseq: a variance component score test for RNA-Seq differential analysis that effectively controls the false discovery rate, NAR Genomics and Bioinformatics, 2(4):lqaa093. |
Authors: | Denis Agniel [aut], Boris P. Hejblum [aut, cre] , Marine Gauthier [aut], Mélanie Huchon [ctb] |
Maintainer: | Boris P. Hejblum <[email protected]> |
License: | GPL-2 | file LICENSE |
Version: | 1.19.0 |
Built: | 2025-01-17 04:13:03 UTC |
Source: | https://github.com/bioc/dearseq |
Differential Expression Analysis RNA-seq data with variance component score test accounting for data heteroscedasticity through precision weights. Perform both gene-wise and gene set analyses, and can deal with repeated or longitudinal data. Methods are detailed in: i) Agniel D & Hejblum BP (2017) Variance component score test for time-course gene set analysis of longitudinal RNA-seq data, Biostatistics, 18(4):589-604 ; and ii) Gauthier M, Agniel D, Thiébaut R & Hejblum BP (2020) dearseq: a variance component score test for RNA-Seq differential analysis that effectively controls the false discovery rate, NAR Genomics and Bioinformatics, 2(4):lqaa093.
Analysis of RNA-seq data with variance component score test accounting for data heteroscedasticity through precision weights. Performs gene-wise analysis as well as gene set analysis, including for complex experimental designs such as longitudinal data.
Package: | dearseq |
Type: | Package |
Version: | 1.13.3 |
Date: | 2023-06-16 |
License: | GPL-2 |
The two main functions of the dearseq
package are
dear_seq
and dgsa_seq
.
Maintainer: Boris P. Hejblum [email protected] (ORCID)
Authors:
Denis Agniel [email protected]
Marine Gauthier [email protected]
Other contributors:
Mélanie Huchon [email protected] [contributor]
Agniel D & Hejblum BP (2017). Variance component score test for time-course gene set analysis of longitudinal RNA-seq data, Biostatistics, 18(4):589-604. DOI: 10.1093/biostatistics/kxx005. arXiv:1605.02351.
Gauthier M, Agniel D, Thiébaut R & Hejblum BP (2020). dearseq: a variance component score test for RNA-Seq differential analysis that effectively controls the false discovery rate, NAR Genomics and Bioinformatics, 2(4):lqaa093. DOI: 10.1093/nargab/lqaa093. DOI: 10.1101/635714
Useful links:
Report bugs at https://github.com/borishejblum/dearseq/issues
A subsample of the RNA-seq data from Baduel et al. studying Arabidopsis Arenosa physiology.
data(baduel_5gs)
data(baduel_5gs)
3 objects
design
: a design matrix for the 48 measured samples, containing
the following variables:
SampleName
corresponding column names from
expr_norm_corr
Intercept
an intercept variable
Population
a factor identifying the plant population
Age_weeks
numeric age of the plant at sampling time (in weeks)
Replicate
a purely technical variable as replicates are not
from the same individual over weeks. Should not be used in analysis.
Vernalized
a logical variable indicating whether the plant had
undergone vernalization (exposition to cold and short day photoperiods)
Vernalized
a binary variable indicating whether the plant
belonged to the KA population
AgeWeeks_Population
interaction variable between the
AgeWeeks
and Population
variables
AgeWeeks_Vernalized
interaction variable between the
AgeWeeks
and Vernalized
variables
Vernalized_Population
interaction variable between the
Vernalized
and Population
variables
AgeWeeks_Vernalized_Population
interaction variable between the
AgeWeeks
, Vernalized
and Population
variables
baduel_gmt
: a gmt
object containing 5 gene sets of
interest (see GSA.read.gmt
), which is simply a
list
with the 3 following components:
genesets
: a list
of n
gene identifiers vectors
composing eachgene set (each gene set is represented as the vector of the
gene identifiers composing it)
geneset.names
: a vector of length n
containing the gene
set names (i.e. gene sets identifiers)
geneset.descriptions: a vector of length n
containing gene set
descriptions (e.g. textual information on their biological function)
expr_norm_corr
: a numeric matrix containing the normalized batch
corrected expression for the 2454 genes included in either of the 5 gene sets
of interests
http://www.ncbi.nlm.nih.gov/bioproject/PRJNA312410
Baduel P, Arnold B, Weisman CM, Hunter B & Bomblies K (2016). Habitat-Associated Life History and Stress-Tolerance Variation in Arabidopsis Arenosa. Plant Physiology, 171(1):437-51. 10.1104/pp.15.01875.
Agniel D & Hejblum BP (2017). Variance component score test for time-course gene set analysis of longitudinal RNA-seq data, Biostatistics, 18(4):589-604. 10.1093/biostatistics/kxx005. arXiv:1605.02351.
if(interactive()){ data('baduel_5gs') set.seed(54321) KAvsTBG <- dgsa_seq(exprmat=log2(expr_norm_corr+1), covariates=apply(as.matrix(design[, c('Intercept', 'Vernalized', 'AgeWeeks', 'Vernalized_Population', 'AgeWeeks_Population'), drop=FALSE]), 2, as.numeric), variables2test = as.matrix(design[, c('PopulationKA'), drop=FALSE]), genesets=baduel_gmt$genesets[c(3,5)], which_test = 'permutation', which_weights = 'loclin', n_perm=1000, preprocessed = TRUE) set.seed(54321) Cold <- dgsa_seq(exprmat=log2(expr_norm_corr+1), covariates=apply(as.matrix(design[, c('Intercept', 'AgeWeeks', 'PopulationKA', 'AgeWeeks_Population'), drop=FALSE]), 2, as.numeric), variables2test=as.matrix(design[, c('Vernalized', 'Vernalized_Population')]), genesets=baduel_gmt$genesets[c(3,5)], which_test = 'permutation', which_weights = 'loclin', n_perm=1000, preprocessed = TRUE) }
if(interactive()){ data('baduel_5gs') set.seed(54321) KAvsTBG <- dgsa_seq(exprmat=log2(expr_norm_corr+1), covariates=apply(as.matrix(design[, c('Intercept', 'Vernalized', 'AgeWeeks', 'Vernalized_Population', 'AgeWeeks_Population'), drop=FALSE]), 2, as.numeric), variables2test = as.matrix(design[, c('PopulationKA'), drop=FALSE]), genesets=baduel_gmt$genesets[c(3,5)], which_test = 'permutation', which_weights = 'loclin', n_perm=1000, preprocessed = TRUE) set.seed(54321) Cold <- dgsa_seq(exprmat=log2(expr_norm_corr+1), covariates=apply(as.matrix(design[, c('Intercept', 'AgeWeeks', 'PopulationKA', 'AgeWeeks_Population'), drop=FALSE]), 2, as.numeric), variables2test=as.matrix(design[, c('Vernalized', 'Vernalized_Population')]), genesets=baduel_gmt$genesets[c(3,5)], which_test = 'permutation', which_weights = 'loclin', n_perm=1000, preprocessed = TRUE) }
Wrapper function for gene-by-gene association testing of RNA-seq data
dear_seq( exprmat = NULL, object = NULL, covariates = NULL, variables2test, sample_group = NULL, weights_var2test_condi = (which_test != "permutation"), cov_variables2test_eff = NULL, which_test = c("permutation", "asymptotic"), which_weights = c("loclin", "voom", "none"), n_perm = 1000, progressbar = TRUE, parallel_comp = TRUE, nb_cores = parallel::detectCores(logical = FALSE) - 1, preprocessed = FALSE, gene_based_weights = FALSE, bw = "nrd", kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "tricube", "cosine", "optcosine"), transform = TRUE, padjust_methods = c("BH", "BY", "holm", "hochberg", "hommel", "bonferroni"), lowess_span = 0.5, R = NULL, adaptive = TRUE, max_adaptive = 64000, homogen_traj = FALSE, na.rm_dearseq = TRUE )
dear_seq( exprmat = NULL, object = NULL, covariates = NULL, variables2test, sample_group = NULL, weights_var2test_condi = (which_test != "permutation"), cov_variables2test_eff = NULL, which_test = c("permutation", "asymptotic"), which_weights = c("loclin", "voom", "none"), n_perm = 1000, progressbar = TRUE, parallel_comp = TRUE, nb_cores = parallel::detectCores(logical = FALSE) - 1, preprocessed = FALSE, gene_based_weights = FALSE, bw = "nrd", kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "tricube", "cosine", "optcosine"), transform = TRUE, padjust_methods = c("BH", "BY", "holm", "hochberg", "hommel", "bonferroni"), lowess_span = 0.5, R = NULL, adaptive = TRUE, max_adaptive = 64000, homogen_traj = FALSE, na.rm_dearseq = TRUE )
exprmat |
a numeric matrix of size |
object |
an object that can be either a
|
covariates |
If |
variables2test |
|
sample_group |
a vector of length |
weights_var2test_condi |
a logical flag indicating whether
heteroscedasticity weights computation should be conditional on both the
variables to be tested |
cov_variables2test_eff |
a matrix of size |
which_test |
a character string indicating which method to use to
approximate the variance component score test, either |
which_weights |
a character string indicating which method to use to
estimate the mean-variance relationship weights. Possibilities are
|
n_perm |
the number of perturbations. Default is |
progressbar |
logical indicating wether a progressBar should be displayed when computing permutations (only in interactive mode). |
parallel_comp |
a logical flag indicating whether parallel computation
should be enabled. Only Linux and MacOS are supported, this is ignored on
Windows. Default is |
nb_cores |
an integer indicating the number of cores to be used when
|
preprocessed |
a logical flag indicating whether the expression data have
already been preprocessed (e.g. log2 transformed). Default is |
gene_based_weights |
a logical flag used for |
bw |
a character string indicating the smoothing bandwidth selection
method to use. See |
kernel |
a character string indicating which kernel should be used.
Possibilities are |
transform |
a logical flag used for |
padjust_methods |
multiple testing correction method used if
|
lowess_span |
smoother span for the lowess function, between 0 and 1.
This gives the proportion of points in the plot which influence the smooth at
each value. Larger values give more smoothness. Only used if
|
R |
library.size (optional, important to provide if
|
adaptive |
a logical flag indicating whether adaptive permutation should
be performed. Default is |
max_adaptive |
The maximum number of permutations considered.
Default is |
homogen_traj |
a logical flag indicating whether trajectories should be
considered homogeneous. Default is |
na.rm_dearseq |
logical: should missing values in |
A list with the following elements:
which_test
: a character string carrying forward the value of
the 'which_test
' argument indicating which test was perform (either
'asymptotic' or 'permutation').
preprocessed
: a logical flag carrying forward the value of the
'preprocessed
' argument indicating whether the expression data were
already preprocessed, or were provided as raw counts and transformed into
log-counts per million.
n_perm
: an integer carrying forward the value of the
'n_perm
' argument indicating the number of perturbations performed
(NA
if asymptotic test was performed).
genesets
: carrying forward the value of the 'genesets
'
argument defining the gene sets of interest (NULL
for gene-wise
testing).
pval
: computed p-values. A data.frame
with one raw for
each each gene set, or for each gene if genesets
argument is
NULL
, and with 2 columns: the first one 'rawPval
' contains
the raw p-values, the second one contains the FDR adjusted p-values
(according to the 'padjust_methods
' argument) and is named
'adjPval
'.
Gauthier M, Agniel D, Thiébaut R & Hejblum BP (2020). dearseq: a variance component score test for RNA-Seq differential analysis that effectivelycontrols the false discovery rate, NAR Genomics and Bioinformatics, 2(4):lqaa093. DOI: 10.1093/nargab/lqaa093. DOI: 10.1101/635714
sp_weights
vc_test_perm
vc_test_asym
p.adjust
#Monte-Carlo estimation of the proportion of DE genes over `nsims` #simulations under the null #number of runs nsims <- 2 #100 res <- numeric(nsims) for(i in 1:nsims){ n <- 1000 #number of genes nr=5 #number of measurements per subject (grouped data) ni=50 #number of subjects r <- nr*ni #number of measurements t <- matrix(rep(1:nr), ni, ncol=1, nrow=r) # the variable to be tested sigma <- 0.5 b0 <- 1 #under the null: b1 <- 0 #create the matrix of gene expression y.tilde <- b0 + b1*t + rnorm(r, sd = sigma) y <- t(matrix(rnorm(n*r, sd = sqrt(sigma*abs(y.tilde))), ncol=n, nrow=r) + matrix(rep(y.tilde, n), ncol=n, nrow=r)) #no covariates x <- matrix(1, ncol=1, nrow=r) #run test #asymptotic test with preprocessed grouped data res_genes <- dear_seq(exprmat=y, covariates=x, variables2test=t, sample_group=rep(1:ni, each=nr), which_test='asymptotic', which_weights='none', preprocessed=TRUE) #proportion of raw p-values>0.05 mean(res_genes$pvals[, 'rawPval']>0.05) #quantiles of raw p-values quantile(res_genes$pvals[, 'rawPval']) #proportion of raw p-values<0.05 i.e. proportion of DE genes res[i] <- mean(res_genes$pvals[, 'rawPval']<0.05) message(i) } #results mean(res) if(interactive()){ b0 <- 1 #under the null: b1 <- 0 #create the matrix of gene expression y.tilde <- b0 + b1*t + rnorm(r, sd = sigma) y <- t(matrix(rnorm(n*r, sd = sqrt(sigma*abs(y.tilde))), ncol=n, nrow=r) + matrix(rep(y.tilde, n), ncol=n, nrow=r)) #run test #asymptotic test with preprocessed grouped data res_genes <- dear_seq(exprmat=y, covariates=x, variables2test=t, sample_group=rep(1:ni, each=nr), which_weights='none', preprocessed=TRUE) #results summary(res_genes$pvals) }
#Monte-Carlo estimation of the proportion of DE genes over `nsims` #simulations under the null #number of runs nsims <- 2 #100 res <- numeric(nsims) for(i in 1:nsims){ n <- 1000 #number of genes nr=5 #number of measurements per subject (grouped data) ni=50 #number of subjects r <- nr*ni #number of measurements t <- matrix(rep(1:nr), ni, ncol=1, nrow=r) # the variable to be tested sigma <- 0.5 b0 <- 1 #under the null: b1 <- 0 #create the matrix of gene expression y.tilde <- b0 + b1*t + rnorm(r, sd = sigma) y <- t(matrix(rnorm(n*r, sd = sqrt(sigma*abs(y.tilde))), ncol=n, nrow=r) + matrix(rep(y.tilde, n), ncol=n, nrow=r)) #no covariates x <- matrix(1, ncol=1, nrow=r) #run test #asymptotic test with preprocessed grouped data res_genes <- dear_seq(exprmat=y, covariates=x, variables2test=t, sample_group=rep(1:ni, each=nr), which_test='asymptotic', which_weights='none', preprocessed=TRUE) #proportion of raw p-values>0.05 mean(res_genes$pvals[, 'rawPval']>0.05) #quantiles of raw p-values quantile(res_genes$pvals[, 'rawPval']) #proportion of raw p-values<0.05 i.e. proportion of DE genes res[i] <- mean(res_genes$pvals[, 'rawPval']<0.05) message(i) } #results mean(res) if(interactive()){ b0 <- 1 #under the null: b1 <- 0 #create the matrix of gene expression y.tilde <- b0 + b1*t + rnorm(r, sd = sigma) y <- t(matrix(rnorm(n*r, sd = sqrt(sigma*abs(y.tilde))), ncol=n, nrow=r) + matrix(rep(y.tilde, n), ncol=n, nrow=r)) #run test #asymptotic test with preprocessed grouped data res_genes <- dear_seq(exprmat=y, covariates=x, variables2test=t, sample_group=rep(1:ni, each=nr), which_weights='none', preprocessed=TRUE) #results summary(res_genes$pvals) }
Wrapper function for performing gene set analysis of (potentially longitudinal) RNA-seq data
dgsa_seq( exprmat = NULL, object = NULL, covariates = NULL, variables2test, weights_var2test_condi = (which_test != "permutation"), genesets, sample_group = NULL, cov_variables2test_eff = NULL, which_test = c("permutation", "asymptotic"), which_weights = c("loclin", "voom", "none"), n_perm = 1000, progressbar = TRUE, parallel_comp = TRUE, nb_cores = parallel::detectCores(logical = FALSE) - 1, preprocessed = FALSE, gene_based_weights = TRUE, bw = "nrd", kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "tricube", "cosine", "optcosine"), transform = TRUE, padjust_methods = c("BH", "BY", "holm", "hochberg", "hommel", "bonferroni"), lowess_span = 0.5, R = NULL, adaptive = TRUE, max_adaptive = 64000, homogen_traj = FALSE, na.rm_gsaseq = TRUE, verbose = TRUE )
dgsa_seq( exprmat = NULL, object = NULL, covariates = NULL, variables2test, weights_var2test_condi = (which_test != "permutation"), genesets, sample_group = NULL, cov_variables2test_eff = NULL, which_test = c("permutation", "asymptotic"), which_weights = c("loclin", "voom", "none"), n_perm = 1000, progressbar = TRUE, parallel_comp = TRUE, nb_cores = parallel::detectCores(logical = FALSE) - 1, preprocessed = FALSE, gene_based_weights = TRUE, bw = "nrd", kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "tricube", "cosine", "optcosine"), transform = TRUE, padjust_methods = c("BH", "BY", "holm", "hochberg", "hommel", "bonferroni"), lowess_span = 0.5, R = NULL, adaptive = TRUE, max_adaptive = 64000, homogen_traj = FALSE, na.rm_gsaseq = TRUE, verbose = TRUE )
exprmat |
a numeric matrix of size |
object |
an object that can be either an
|
covariates |
If |
variables2test |
|
weights_var2test_condi |
a logical flag indicating whether
heteroscedasticity weights computation should be conditional on both the
variable(s) to be tested |
genesets |
Can be either:
Can be a vector of index or subscripts that defines which
rows of Can also be a Finally, can also be a If |
sample_group |
a vector of length |
cov_variables2test_eff |
a matrix of size |
which_test |
a character string indicating which method to use to
approximate the variance component score test, either |
which_weights |
a character string indicating which method to use to
estimate the mean-variance relationship weights. Possibilities are
|
n_perm |
the number of perturbations. Default is |
progressbar |
logical indicating wether a progressBar should be displayed when computing permutations (only in interactive mode). |
parallel_comp |
a logical flag indicating whether parallel computation
should be enabled. Only Linux and MacOS are supported, this is ignored on
Windows. Default is |
nb_cores |
an integer indicating the number of cores to be used when
|
preprocessed |
a logical flag indicating whether the expression data have
already been preprocessed (e.g. log2 transformed). Default is |
gene_based_weights |
a logical flag used for |
bw |
a character string indicating the smoothing bandwidth selection
method to use. See |
kernel |
a character string indicating which kernel should be used.
Possibilities are |
transform |
a logical flag used for |
padjust_methods |
multiple testing correction method used if
|
lowess_span |
smoother span for the lowess function, between 0 and 1.
This gives the proportion of points in the plot which influence the smooth at
each value. Larger values give more smoothness. Only used if
|
R |
library size (optional, important to provide if
|
adaptive |
a logical flag indicating whether adaptive permutation should
be performed. Default is |
max_adaptive |
The maximum number of permutations considered.
Default is |
homogen_traj |
a logical flag indicating whether trajectories should be
considered homogeneous. Default is |
na.rm_gsaseq |
logical: should missing values in |
verbose |
logical: should informative messages be printed during the
computation? Default is |
A list with the following elements:
which_test
: a character string carrying forward the value of
the 'which_test
' argument indicating which test was perform (either
'asymptotic' or 'permutation').
preprocessed
: a logical flag carrying forward the value of the
'preprocessed
' argument indicating whether the expression data were
already preprocessed, or were provided as raw counts and transformed into
log-counts per million.
n_perm
: an integer carrying forward the value of the
'n_perm
' argument indicating the number of perturbations performed
(NA
if asymptotic test was performed).
genesets
: carrying forward the value of the 'genesets
'
argument defining the gene sets of interest (NULL
for gene-wise t
esting).
pval
: computed p-values. A data.frame
with one raw for
each each gene set, or for each gene if genesets
argument is
NULL
, and with 2 columns: the first one 'rawPval
' contains
the raw p-values, the second one contains the FDR adjusted p-values
(according to the 'padjust_methods
' argument) and is named
'adjPval
'.
Agniel D & Hejblum BP (2017). Variance component score test for time-course gene set analysis of longitudinal RNA-seq data, Biostatistics, 18(4):589-604. 10.1093/biostatistics/kxx005. arXiv:1605.02351.
Law, C. W., Chen, Y., Shi, W., & Smyth, G. K. (2014). voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 15(2), R29.
sp_weights
vc_test_perm
vc_test_asym
p.adjust
nsims <- 2 #100 res_quant <- list() for(i in 1:2){ n <- 2000#0 nr <- 3 r <- nr*20 #4*nr#100*nr t <- matrix(rep(1:nr), r/nr, ncol=1, nrow=r) sigma <- 0.4 b0 <- 1 #under the null: b1 <- 0 y.tilde <- b0 + b1*t + rnorm(r, sd = sigma) y <- t(matrix(rnorm(n*r, sd = sqrt(sigma*abs(y.tilde))), ncol=n, nrow=r) + matrix(rep(y.tilde, n), ncol=n, nrow=r)) x <- matrix(1, ncol=1, nrow=r) #run test res <- dgsa_seq(exprmat = y, covariates = x, variables2test = t, genesets=lapply(0:9, function(x){x*10+(1:10)}), cov_variables2test_eff = matrix(1), sample_group = rep(1:(r/nr), each=nr), which_test='asymptotic', which_weights='none', preprocessed=TRUE) res_genes <- dgsa_seq(exprmat = y, covariates = x, variables2test = cbind(t),#, rnorm(r)), #t^2 genesets = NULL, cov_variables2test_eff = diag(1), sample_group = rep(1:(r/nr), each=nr), which_test = 'asymptotic', which_weights = 'none', preprocessed = TRUE) length(res_genes$pvals[, 'rawPval']) quantile(res_genes$pvals[, 'rawPval']) res_quant[[i]] <- res_genes$pvals[, 'rawPval'] } #round(rowMeans(vapply(res_quant, FUN = quantile, FUN.VALUE = rep(1.1, 5))), 3) #plot(density(unlist(res_quant))) #mean(unlist(res_quant)<0.05) if(interactive()){ res_genes <- dgsa_seq(exprmat = y, covariates = x, variables2test = t, genesets = NULL, cov_variables2test_eff = matrix(1), sample_group = rep(1:(r/nr), each=nr), which_test = 'permutation', which_weights = 'none', preprocessed = TRUE, n_perm = 1000, parallel_comp = FALSE) mean(res_genes$pvals$rawPval < 0.05) summary(res_genes$pvals$adjPval) }
nsims <- 2 #100 res_quant <- list() for(i in 1:2){ n <- 2000#0 nr <- 3 r <- nr*20 #4*nr#100*nr t <- matrix(rep(1:nr), r/nr, ncol=1, nrow=r) sigma <- 0.4 b0 <- 1 #under the null: b1 <- 0 y.tilde <- b0 + b1*t + rnorm(r, sd = sigma) y <- t(matrix(rnorm(n*r, sd = sqrt(sigma*abs(y.tilde))), ncol=n, nrow=r) + matrix(rep(y.tilde, n), ncol=n, nrow=r)) x <- matrix(1, ncol=1, nrow=r) #run test res <- dgsa_seq(exprmat = y, covariates = x, variables2test = t, genesets=lapply(0:9, function(x){x*10+(1:10)}), cov_variables2test_eff = matrix(1), sample_group = rep(1:(r/nr), each=nr), which_test='asymptotic', which_weights='none', preprocessed=TRUE) res_genes <- dgsa_seq(exprmat = y, covariates = x, variables2test = cbind(t),#, rnorm(r)), #t^2 genesets = NULL, cov_variables2test_eff = diag(1), sample_group = rep(1:(r/nr), each=nr), which_test = 'asymptotic', which_weights = 'none', preprocessed = TRUE) length(res_genes$pvals[, 'rawPval']) quantile(res_genes$pvals[, 'rawPval']) res_quant[[i]] <- res_genes$pvals[, 'rawPval'] } #round(rowMeans(vapply(res_quant, FUN = quantile, FUN.VALUE = rep(1.1, 5))), 3) #plot(density(unlist(res_quant))) #mean(unlist(res_quant)<0.05) if(interactive()){ res_genes <- dgsa_seq(exprmat = y, covariates = x, variables2test = t, genesets = NULL, cov_variables2test_eff = matrix(1), sample_group = rep(1:(r/nr), each=nr), which_test = 'permutation', which_weights = 'none', preprocessed = TRUE, n_perm = 1000, parallel_comp = FALSE) mean(res_genes$pvals$rawPval < 0.05) summary(res_genes$pvals$adjPval) }
9 Pathogenesis Based Transcripts (PBT) gene sets specifically related to kidney transplant
data(PBT_gmt)
data(PBT_gmt)
a gmt
object containing 9 gene sets specific to kidney
transplant (see GSA.read.gmt
), which is simply a
list
with the 3 following components:
genesets
: a list
of n
gene identifiers vectors
composing eachgene set (each gene set is represented as the vector of the
gene identifiers composing it)
geneset.names
: a vector of length n
containing the gene
set names (i.e. gene sets identifiers)
geneset.descriptions: a vector of length n
containing gene set
descriptions (e.g. textual information on their biological function)
http://atagc.med.ualberta.ca/Research/GeneLists
Halloran PF, De Freitas DG, Einecke G, et al., The molecular phenotype of kidney transplants: Personal viewpoint, Am J Transplant, 10: 2215-2222, 2010. .
Sellares J, Reeve J, Loupy A, et al., Molecular diagnosis of antibody-mediated rejection in human kidney transplants, Am J Transplant, 13:971-983, 2013.
Broin PO, Hayde N, Bao Y, et al., A pathogenesis-based transcript signature in donor-specific antibody-positive kidney transplant patients with normal biopsies, Genomics Data 2: 357-60, 2014.
data('PBT_gmt') PBT_gmt
data('PBT_gmt') PBT_gmt
Calculates exact p-values for permutation tests when permutations are randomly
drawn with replacement. This implementation is based on
(slightly adapted) the implementation by Belinda Phipson and Gordon Smyth
from the R package statmod
permPvals(nPermSupObs, nPermEff, totalPossibleNPerm)
permPvals(nPermSupObs, nPermEff, totalPossibleNPerm)
nPermSupObs |
number of permutations that yielded test statistics at least as extreme as the observed data. Can be a vector or an array of values. |
nPermEff |
number of permutations effectively computed. |
totalPossibleNPerm |
total number of permutations possible. |
a vector (or an array, similar to nperm_supobs
)
of exact p-values
Belinda Phipson and Gordon Smyth (adapted by Boris Hejblum)
Phipson B, and Smyth GK (2010). Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn. Statistical Applications in Genetics and Molecular Biology, Volume 9, Issue 1, Article 39. http://www.statsci.org/smyth/pubs/PermPValuesPreprint.pdf
statmod::permp
permPvals(10, 100, 1000)
permPvals(10, 100, 1000)
Display the histogram of raw p-values for diagnostic plots
plot_hist_pvals(pvals, binwidth = 0.02)
plot_hist_pvals(pvals, binwidth = 0.02)
pvals |
a vector of raw p-values |
binwidth |
a value specifying the width of the histogram bins.
Default is |
a ggplot
object
Boris Hejblum
#generate fake data n <- 1000 #number of genes nr=5 #number of measurements per subject (grouped data) ni=50 #number of subjects r <- nr*ni #number of measurements t <- matrix(rep(1:nr), ni, ncol=1, nrow=r) # the variable to be tested sigma <- 0.5 x <- matrix(1, ncol=1, nrow=r) #no covariates only intercept y.tilde <- rnorm(r, sd = sigma) y <- t(matrix(rnorm(n*r, sd = sqrt(sigma*abs(y.tilde))), ncol=n, nrow=r) + matrix(rep(y.tilde, n), ncol=n, nrow=r)) #Run dear_seq() res_genes <- dear_seq(exprmat=y, covariates=x, variables2test=t, sample_group=rep(1:ni, each=nr), which_test = "asymptotic", which_weights='none', preprocessed=TRUE) #Plot plot_hist_pvals(res_genes$pvals$rawPval)
#generate fake data n <- 1000 #number of genes nr=5 #number of measurements per subject (grouped data) ni=50 #number of subjects r <- nr*ni #number of measurements t <- matrix(rep(1:nr), ni, ncol=1, nrow=r) # the variable to be tested sigma <- 0.5 x <- matrix(1, ncol=1, nrow=r) #no covariates only intercept y.tilde <- rnorm(r, sd = sigma) y <- t(matrix(rnorm(n*r, sd = sqrt(sigma*abs(y.tilde))), ncol=n, nrow=r) + matrix(rep(y.tilde, n), ncol=n, nrow=r)) #Run dear_seq() res_genes <- dear_seq(exprmat=y, covariates=x, variables2test=t, sample_group=rep(1:ni, each=nr), which_test = "asymptotic", which_weights='none', preprocessed=TRUE) #Plot plot_hist_pvals(res_genes$pvals$rawPval)
This function prints the sorted exact p-values along with the Benjamini-Hochberg limit and the 5
plot_ord_pvals(pvals, signif_threshold = 0.05)
plot_ord_pvals(pvals, signif_threshold = 0.05)
pvals |
a vector of length |
signif_threshold |
a value between |
a plot of sorted gene-wise p-values
#generate fake data n <- 1000 #number of genes nr=5 #number of measurements per subject (grouped data) ni=50 #number of subjects r <- nr*ni #number of measurements t <- matrix(rep(1:nr), ni, ncol=1, nrow=r) # the variable to be tested sigma <- 0.5 x <- matrix(1, ncol=1, nrow=r) #no covariates only intercept y.tilde <- rnorm(r, sd = sigma) y <- t(matrix(rnorm(n*r, sd = sqrt(sigma*abs(y.tilde))), ncol=n, nrow=r) + matrix(rep(y.tilde, n), ncol=n, nrow=r)) #Run dear_seq() res_genes <- dear_seq(exprmat=y, covariates=x, variables2test=t, sample_group=rep(1:ni, each=nr), which_test = "asymptotic", which_weights='none', preprocessed=TRUE) #Plot plot_ord_pvals(res_genes$pvals$rawPval)
#generate fake data n <- 1000 #number of genes nr=5 #number of measurements per subject (grouped data) ni=50 #number of subjects r <- nr*ni #number of measurements t <- matrix(rep(1:nr), ni, ncol=1, nrow=r) # the variable to be tested sigma <- 0.5 x <- matrix(1, ncol=1, nrow=r) #no covariates only intercept y.tilde <- rnorm(r, sd = sigma) y <- t(matrix(rnorm(n*r, sd = sqrt(sigma*abs(y.tilde))), ncol=n, nrow=r) + matrix(rep(y.tilde, n), ncol=n, nrow=r)) #Run dear_seq() res_genes <- dear_seq(exprmat=y, covariates=x, variables2test=t, sample_group=rep(1:ni, each=nr), which_test = "asymptotic", which_weights='none', preprocessed=TRUE) #Plot plot_ord_pvals(res_genes$pvals$rawPval)
Display the variability with respect to the level of expression and the associated smoothed estimation of precision weights accounting for heteroscedasticity.
plot_weights(x)
plot_weights(x)
x |
a list (such as outputed by the functions
|
a ggplot
object
Boris Hejblum
G <- 10000 n <- 12 p <- 2 y <- sapply(1:n, FUN = function(x){rnbinom(n = G, size = 0.07, mu = 200)}) x <- sapply(1:p, FUN = function(x){rnorm(n = n, mean = n, sd = 1)}) if(interactive()){ w <- sp_weights(y, x, use_phi=FALSE, na.rm = TRUE, gene_based=TRUE) plot_weights(w) vw <- voom_weights(y, x) plot_weights(vw) }
G <- 10000 n <- 12 p <- 2 y <- sapply(1:n, FUN = function(x){rnbinom(n = G, size = 0.07, mu = 200)}) x <- sapply(1:p, FUN = function(x){rnorm(n = n, mean = n, sd = 1)}) if(interactive()){ w <- sp_weights(y, x, use_phi=FALSE, na.rm = TRUE, gene_based=TRUE) plot_weights(w) vw <- voom_weights(y, x) plot_weights(vw) }
Plot method for dearseq objects
## S3 method for class 'dearseq' plot(x, signif_threshold = 0.05, ...)
## S3 method for class 'dearseq' plot(x, signif_threshold = 0.05, ...)
x |
an object of class |
signif_threshold |
a value between |
... |
further arguments |
a ggplot
object
Boris Hejblum
Computes precision weights that account for heteroscedasticity in RNA-seq count data based on non-parametric local linear regression estimates.
sp_weights( y, x, phi = NULL, use_phi = TRUE, preprocessed = FALSE, gene_based = FALSE, bw = c("nrd", "ucv", "SJ", "nrd0", "bcv"), kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "tricube", "cosine", "optcosine"), transform = TRUE, verbose = TRUE, na.rm = FALSE )
sp_weights( y, x, phi = NULL, use_phi = TRUE, preprocessed = FALSE, gene_based = FALSE, bw = c("nrd", "ucv", "SJ", "nrd0", "bcv"), kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight", "tricube", "cosine", "optcosine"), transform = TRUE, verbose = TRUE, na.rm = FALSE )
y |
a numeric matrix of size |
x |
a numeric matrix of size |
phi |
a numeric design matrix of size |
use_phi |
a logical flag indicating whether conditional means should be
conditioned on |
preprocessed |
a logical flag indicating whether the expression data have
already been preprocessed (e.g. log2 transformed). Default is |
gene_based |
a logical flag indicating whether to estimate weights at the
gene-level. Default is |
bw |
a character string indicating the smoothing bandwidth selection
method to use. See |
kernel |
a character string indicating which kernel should be used.
Possibilities are |
transform |
a logical flag indicating whether values should be
transformed to uniform for the purpose of local linear smoothing. This may be
helpful if tail observations are sparse and the specified bandwidth gives
suboptimal performance there. Default is |
verbose |
a logical flag indicating whether informative messages are
printed during the computation. Default is |
na.rm |
logical: should missing values (including |
a list containing the following components:
weights
: a matrix n x G
containing the computed precision
weights
plot_utilities
: a list containing the following elements:
reverse_trans
: a function encoding the reverse function used
for smoothing the observations before computing the weights
method
: the weight computation method ("loclin"
)
smth
: the vector of the smoothed values computed
gene_based
: a logical indicating whether the computed
weights are based on average at the gene level or on individual
observations
mu
: the transformed observed counts or averages
v
: the observed variability estimates
Boris Hejblum
set.seed(123) G <- 10000 n <- 12 p <- 2 y <- sapply(1:n, FUN = function(x){rnbinom(n = G, size = 0.07, mu = 200)}) x <- sapply(1:p, FUN = function(x){rnorm(n = n, mean = n, sd = 1)}) w <- sp_weights(y, x, use_phi=FALSE, na.rm = TRUE)
set.seed(123) G <- 10000 n <- 12 p <- 2 y <- sapply(1:n, FUN = function(x){rnbinom(n = G, size = 0.07, mu = 200)}) x <- sapply(1:p, FUN = function(x){rnorm(n = n, mean = n, sd = 1)}) w <- sp_weights(y, x, use_phi=FALSE, na.rm = TRUE)
Spaghetti plot for Specific Gene Set
spaghettiPlot1GS( gs_index, gmt, expr_mat, design, var_time, var_indiv, sampleIdColname, var_group = NULL, var_subgroup = NULL, plotChoice = c("Medians", "Individual"), loess_span = 0.75 )
spaghettiPlot1GS( gs_index, gmt, expr_mat, design, var_time, var_indiv, sampleIdColname, var_group = NULL, var_subgroup = NULL, plotChoice = c("Medians", "Individual"), loess_span = 0.75 )
gs_index |
index of the specific gene set in |
gmt |
a |
expr_mat |
a |
design |
a |
var_time |
the |
var_indiv |
the patient variable contained in |
sampleIdColname |
a character string indicating the name of the sample ID
variable in |
var_group |
a group variable in |
var_subgroup |
a subgroup variable in |
plotChoice |
to choose which type of plot (either |
loess_span |
smoothing span. Default is |
a ggplot2 plot object
data(baduel_5gs) design$Indiv <- design$Population:design$Replicate design$Vern <- ifelse(design$Vernalized, "Vernalized", "Non-vernalized") library(ggplot2) spaghettiPlot1GS(gs_index = 3, gmt = baduel_gmt, expr_mat = log2(expr_norm_corr+1), design = design, var_time = AgeWeeks, var_indiv = Indiv, sampleIdColname = "sample", var_group=Vern, var_subgroup=Population, plotChoice = "Medians", loess_span= 1.5) + xlab("Age (weeks)") + guides(color= "none")
data(baduel_5gs) design$Indiv <- design$Population:design$Replicate design$Vern <- ifelse(design$Vernalized, "Vernalized", "Non-vernalized") library(ggplot2) spaghettiPlot1GS(gs_index = 3, gmt = baduel_gmt, expr_mat = log2(expr_norm_corr+1), design = design, var_time = AgeWeeks, var_indiv = Indiv, sampleIdColname = "sample", var_group=Vern, var_subgroup=Population, plotChoice = "Medians", loess_span= 1.5) + xlab("Age (weeks)") + guides(color= "none")
Summary method for dearseq objects
## S3 method for class 'dearseq' summary(object, signif_threshold = 0.05, ...) ## S3 method for class 'summary.dearseq' print(x, ...)
## S3 method for class 'dearseq' summary(object, signif_threshold = 0.05, ...) ## S3 method for class 'summary.dearseq' print(x, ...)
object |
an object of class |
signif_threshold |
a value between |
... |
further arguments |
x |
an object of class ' |
a list
Boris Hejblum
This function computes an approximation of the variance component test based
on the asymptotic distribution of a mixture of s using the saddlepoint
method from
pchisqsum
, as per Chen & Lumley 20219 CSDA.
vc_test_asym( y, x, indiv = rep(1, nrow(x)), phi, w, Sigma_xi = diag(ncol(phi)), genewise_pvals = FALSE, homogen_traj = FALSE, na.rm = FALSE )
vc_test_asym( y, x, indiv = rep(1, nrow(x)), phi, w, Sigma_xi = diag(ncol(phi)), genewise_pvals = FALSE, homogen_traj = FALSE, na.rm = FALSE )
y |
a numeric matrix of dim |
x |
a numeric design matrix of dim |
indiv |
a vector of length |
phi |
a numeric design matrix of size |
w |
a vector of length |
Sigma_xi |
a matrix of size |
genewise_pvals |
a logical flag indicating whether gene-wise p-values
should be returned. Default is |
homogen_traj |
a logical flag indicating whether trajectories should be
considered homogeneous. Default is |
na.rm |
logical: should missing values (including |
A list with the following elements when the set p-value is computed:
set_score_obs
: the approximation of the observed set score
set_pval
: the associated set p-value
or a list with the following elements when gene-wise p-values are computed:
gene_scores_obs
: vector of approximating the observed
gene-wise scores
gene_pvals
: vector of associated gene-wise p-values
Chen T & Lumley T (2019), Numerical evaluation of methods approximating the distribution of a large quadratic form in normal variables, Computational Statistics & Data Analysis, 139:75-81.
set.seed(123) ##generate some fake data ######################## n <- 100 r <- 12 t <- matrix(rep(1:(r/4)), 4, ncol=1, nrow=r) sigma <- 0.4 b0 <- 1 #under the null: b1 <- 0 #under the alternative: #b1 <- 0.5 y.tilde <- b0 + b1*t + rnorm(r, sd = sigma) y <- t(matrix(rnorm(n*r, sd = sqrt(sigma*abs(y.tilde))), ncol=n, nrow=r) + matrix(rep(y.tilde, n), ncol=n, nrow=r)) x <- matrix(1, ncol=1, nrow=r) #run test asymTestRes <- vc_test_asym(y, x, phi=cbind(t, t^2), w=matrix(1, ncol=ncol(y), nrow=nrow(y)), Sigma_xi=diag(2), indiv=1:r, genewise_pvals=TRUE) plot(density(asymTestRes$gene_pvals)) quantile(asymTestRes$gene_pvals)
set.seed(123) ##generate some fake data ######################## n <- 100 r <- 12 t <- matrix(rep(1:(r/4)), 4, ncol=1, nrow=r) sigma <- 0.4 b0 <- 1 #under the null: b1 <- 0 #under the alternative: #b1 <- 0.5 y.tilde <- b0 + b1*t + rnorm(r, sd = sigma) y <- t(matrix(rnorm(n*r, sd = sqrt(sigma*abs(y.tilde))), ncol=n, nrow=r) + matrix(rep(y.tilde, n), ncol=n, nrow=r)) x <- matrix(1, ncol=1, nrow=r) #run test asymTestRes <- vc_test_asym(y, x, phi=cbind(t, t^2), w=matrix(1, ncol=ncol(y), nrow=nrow(y)), Sigma_xi=diag(2), indiv=1:r, genewise_pvals=TRUE) plot(density(asymTestRes$gene_pvals)) quantile(asymTestRes$gene_pvals)
This function computes an approximation of the Variance Component test for a
mixture of s using permutations. This is preferable to the
asymptotic approximation for small sample sizes. We rely on exact p-values
following Phipson and Smyth, 2010 (see References).
vc_test_perm( y, x, indiv = rep(1, nrow(x)), phi, w, Sigma_xi = diag(ncol(phi)), n_perm = 1000, progressbar = TRUE, parallel_comp = TRUE, nb_cores = parallel::detectCores(logical = FALSE) - 1, genewise_pvals = FALSE, adaptive = TRUE, max_adaptive = 64000, homogen_traj = FALSE, na.rm = FALSE )
vc_test_perm( y, x, indiv = rep(1, nrow(x)), phi, w, Sigma_xi = diag(ncol(phi)), n_perm = 1000, progressbar = TRUE, parallel_comp = TRUE, nb_cores = parallel::detectCores(logical = FALSE) - 1, genewise_pvals = FALSE, adaptive = TRUE, max_adaptive = 64000, homogen_traj = FALSE, na.rm = FALSE )
y |
a numeric matrix of dim |
x |
a numeric design matrix of dim |
indiv |
a vector of length |
phi |
a numeric design matrix of size |
w |
a vector of length |
Sigma_xi |
a matrix of size |
n_perm |
the number of perturbations. Default is |
progressbar |
logical indicating wether a progressBar should be displayed when computing permutations (only in interactive mode). |
parallel_comp |
a logical flag indicating whether parallel computation
should be enabled. Only Linux and MacOS are supported, this is ignored on
Windows. Default is |
nb_cores |
an integer indicating the number of cores to be used when
|
genewise_pvals |
a logical flag indicating whether gene-wise p-values
should be returned. Default is |
adaptive |
a logical flag indicating whether adaptive permutation should
be performed. Default is |
max_adaptive |
The maximum number of permutations considered.
Default is |
homogen_traj |
a logical flag indicating whether trajectories should be
considered homogeneous. Default is |
na.rm |
logical: should missing values (including |
A list with the following elements when the set p-value is computed:
set_score_obs
: the approximation of the observed set score
set_pval
: the associated set p-value
or a list with the following elements when gene-wise p-values are computed:
gene_scores_obs
: vector of approximating the observed
gene-wise scores
gene_pvals
: vector of associated gene-wise p-values
ds_fdr
: vector of associated gene-wise discrete false
discovery rates
Phipson B, and Smyth GK (2010). Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn. Statistical Applications in Genetics and Molecular Biology, Volume 9, Issue 1, Article 39. http://www.statsci.org/smyth/pubs/PermPValuesPreprint.pdf
set.seed(123) ##generate some fake data ######################## n <- 100 r <- 12 t <- matrix(rep(1:3), 4, ncol=1, nrow=r) sigma <- 0.4 b0 <- 1 #under the null: b1 <- 0 #under the alternative: b1 <- 0.5 y.tilde <- b0 + b1*t + rnorm(r, sd = sigma) y <- t(matrix(rnorm(n*r, sd = sqrt(sigma*abs(y.tilde))), ncol=n, nrow=r) + matrix(rep(y.tilde, n), ncol=n, nrow=r)) x <- matrix(1, ncol=1, nrow=r) #run test permTestRes <- vc_test_perm(y, x, phi=t, w=matrix(1, ncol=ncol(y), nrow=nrow(y)), indiv=rep(1:4, each=3), n_perm=50, #1000, parallel_comp = FALSE) permTestRes$set_pval
set.seed(123) ##generate some fake data ######################## n <- 100 r <- 12 t <- matrix(rep(1:3), 4, ncol=1, nrow=r) sigma <- 0.4 b0 <- 1 #under the null: b1 <- 0 #under the alternative: b1 <- 0.5 y.tilde <- b0 + b1*t + rnorm(r, sd = sigma) y <- t(matrix(rnorm(n*r, sd = sqrt(sigma*abs(y.tilde))), ncol=n, nrow=r) + matrix(rep(y.tilde, n), ncol=n, nrow=r)) x <- matrix(1, ncol=1, nrow=r) #run test permTestRes <- vc_test_perm(y, x, phi=t, w=matrix(1, ncol=ncol(y), nrow=nrow(y)), indiv=rep(1:4, each=3), n_perm=50, #1000, parallel_comp = FALSE) permTestRes$set_pval
Implementation of the procedure described in Law et al. for estimating precision weights from RNA-seq data.
voom_weights(y, x, preprocessed = FALSE, lowess_span = 0.5, R = NULL)
voom_weights(y, x, preprocessed = FALSE, lowess_span = 0.5, R = NULL)
y |
a matrix of size |
x |
a matrix of size |
preprocessed |
a logical flag indicating whether the expression data have
already been preprocessed (e.g. log2 transformed). Default is |
lowess_span |
smoother span for the lowess function, between 0 and 1.
This gives the proportion of points in the plot which influence the smooth at
each value. Larger values give more smoothness. Default is |
R |
library.size (optional, important to provide if
|
a vector of length n
containing the computed precision weights
Boris Hejblum
Law, C. W., Chen, Y., Shi, W., & Smyth, G. K. (2014). voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 15(2), R29.
set.seed(123) G <- 10000 n <- 12 p <- 2 y <- sapply(1:n, FUN=function(x){rnbinom(n=G, size=0.07, mu=200)}) x <- sapply(1:p, FUN=function(x){rnorm(n=n, mean=n, sd=1)}) my_w <- voom_weights(y, x) plot_weights(my_w) if (requireNamespace('limma', quietly = TRUE)) { w_voom <- limma::voom(counts=y, design=x, plot=TRUE) #slightly faster, same results all.equal(my_w$weights, w_voom$weights) } if(interactive()){ #microbenchmark::microbenchmark(limma::voom(counts=t(y), design=x, # plot=FALSE), voom_weights(x, y), # times=30) }
set.seed(123) G <- 10000 n <- 12 p <- 2 y <- sapply(1:n, FUN=function(x){rnbinom(n=G, size=0.07, mu=200)}) x <- sapply(1:p, FUN=function(x){rnorm(n=n, mean=n, sd=1)}) my_w <- voom_weights(y, x) plot_weights(my_w) if (requireNamespace('limma', quietly = TRUE)) { w_voom <- limma::voom(counts=y, design=x, plot=TRUE) #slightly faster, same results all.equal(my_w$weights, w_voom$weights) } if(interactive()){ #microbenchmark::microbenchmark(limma::voom(counts=t(y), design=x, # plot=FALSE), voom_weights(x, y), # times=30) }