Title: | Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size |
---|---|
Description: | The purpose of this package is to discover the genes that are differentially expressed between two conditions in RNA-seq experiments. Gene expression is measured in counts of transcripts and modeled with the Negative Binomial (NB) distribution using a shrinkage approach for dispersion estimation. The method of moment (MM) estimates for dispersion are shrunk towards an estimated target, which minimizes the average squared difference between the shrinkage estimates and the initial estimates. The exact per-gene probability under the NB model is calculated, and used to test the hypothesis that the expected expression of a gene in two conditions identically follow a NB distribution. |
Authors: | Danni Yu <[email protected]>, Wolfgang Huber <[email protected]> and Olga Vitek <[email protected]> |
Maintainer: | Danni Yu <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.45.0 |
Built: | 2024-10-31 05:28:29 UTC |
Source: | https://github.com/bioc/sSeq |
This package is to discover the genes that differentially expressed between two conditions based on RNA-seq experiments. Gene expression is measured in counts of transcripts and modeled with the Negative Binomial (NB) distribution using a shrinkage approach for dispersion estimation. The method of moment (MM) estimates for dispersion are simply shrunk toward a target, which minimizes the average squared difference between the shrinkage estimates and the initial estimates. The exact per-gene probability under the NB model is calculated, and used to test the hypothesis that the expected expression of a gene in two conditions are not different.
Package: | sSeq |
Type: | Package |
Version: | 1.0 |
Date: | 2013-02-25 |
Danni Yu <[email protected]>, Wolfgang Huber <[email protected]> and Olga Vitek <[email protected]>
Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size
#load a simulated data that includes a count table data("countsTable") #calculate the p-values using the shrinkage approach. conds <- c("A", "B") resJS <- nbTestSH( countsTable, conds, "A", "B")
#load a simulated data that includes a count table data("countsTable") #calculate the p-values using the shrinkage approach. conds <- c("A", "B") resJS <- nbTestSH( countsTable, conds, "A", "B")
A subset of simulated data. It is used as an example for running some functions in this package.
data(countsTable)
data(countsTable)
The format is: num [1:10000, 1:2] 90 155 13347 254 228 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:10000] "1_FALSE" "2_FALSE" "3_TRUE" "4_FALSE" ... ..$ : chr [1:2] "A1" "B1"
A simulation counts table.
data(countsTable) head(countsTable)
data(countsTable) head(countsTable)
Based on the count table and the p-values, this function can be used to draw a MA plot of the log2 ratios versus the log2 averages upon means of gene expression in condition A and B, and a volcano plot of negative log2 p-values versus the log2 ratios.
drawMA_vol(y, groups2, pv, cutoff=NULL, xlab1="(log2(A)+log2(B))/2", ylab1="log2(A)-log2(B)", tt1="MA plot", tt2="volcano plot", log2FoldChange =NULL, col1=c("black","red"))
drawMA_vol(y, groups2, pv, cutoff=NULL, xlab1="(log2(A)+log2(B))/2", ylab1="log2(A)-log2(B)", tt1="MA plot", tt2="volcano plot", log2FoldChange =NULL, col1=c("black","red"))
y |
A count table in which row represents genes and column represents samples. |
groups2 |
A vector indicates the two groups information of samples. It must match to the column in the count table, which is the input for y. For example, groups2=c("A","A","B","B") when the first two columns in the count table are the two samples from condition A, and the second two columns in the count table are the two samples from condition B. |
pv |
A vector of per-gene p-values based on the count table. The order of genes in pv does matter. It must be the same as the order of genes in the count table. |
cutoff |
A value used as a threshold for per-gene p-values to decide the genes that are differentially expressed between two conditions. If NULL, the cutoff value is calculated so that the red dots in the MA plot and volcano plot represent the first 5 |
xlab1 |
A character indicating the label of x axis in MA plot. |
ylab1 |
A character indicating the label of y axis in MA plot. |
tt1 |
A character indicating the title of the MA plot. |
tt2 |
A character indicating the title of the volcano plot. |
log2FoldChange |
A vector of fold changes in log2 scale. It will be calculated automatically when “log2FoldChange=NULL". |
col1 |
A vector with two values including the colors of points. The first color in “col1" is the color for the points that are non-differentially changed. The second value in “col1" is the color for the points that are differentially changed. The default is c(“black", “red"). |
x <- matrix(rnorm(4000, 10), ncol=4) px <- apply(x, 1, function(y){t.test(y[1:2], y[3:4])$p.value}) drawMA_vol(x, c("A","A","B","B"), px, cutoff=0.05)
x <- matrix(rnorm(4000, 10), ncol=4) px <- apply(x, 1, function(y){t.test(y[1:2], y[3:4])$p.value}) drawMA_vol(x, c("A","A","B","B"), px, cutoff=0.05)
This function is used to draw Empirical CDF plot. It relies on the trapz function in the caTools package. A user needs to install the caTool library first.
ecdfAUC(dd, col.line=NULL, main="ECDF", cex.leg=1, drawRef=FALSE, rm1=FALSE, lineType=NULL, addLeg=TRUE, xlab="p-value", ylab="ECDF", cex.axis=1.5, cex.main=1.8, cex.lab=1.2, axis.padj=c(-1, 1), lab.padj=c(-1.5, 1), lwd=1, box.lwd=1.2)
ecdfAUC(dd, col.line=NULL, main="ECDF", cex.leg=1, drawRef=FALSE, rm1=FALSE, lineType=NULL, addLeg=TRUE, xlab="p-value", ylab="ECDF", cex.axis=1.5, cex.main=1.8, cex.lab=1.2, axis.padj=c(-1, 1), lab.padj=c(-1.5, 1), lwd=1, box.lwd=1.2)
dd |
A data frame of p-values in which a column represents the p-values or posterior probabilities resulted by a method. |
col.line |
A vector of color characters. The default is NULL and this function automatically assigns the color for each cover shown in the ECDF plot. |
main |
The title of the plot. |
cex.leg |
An integer specifying size of the legend in the the plot. |
drawRef |
If TRUE, then a gray 45 degree line will be added in the plot. |
rm1 |
If users believe that the p-values equal to 1 belong to the different group of the others, and want to exclude them from the calculation of empirical CDF, then use rm1=TRUE. |
lineType |
A vector of integers indicating the type of lines used for the methods. |
addLeg |
If “TRUE" then a legend box with legend is added to the figure. |
xlab |
Label of x axis. |
ylab |
Label of y axis. |
cex.axis |
The size of labels on the axes. |
cex.main |
A characteristic string indicating the size of the main. |
cex.lab |
The size for the labels on x and y. |
axis.padj |
The perpendicular adjustment of ticks. |
lab.padj |
The perpendicular adjustment of labels for an axix. |
lwd |
The width of the line shown in a figure. |
box.lwd |
The width of the box line in a figure. |
x<-data.frame(A=runif(100), B=rbeta(100, 0.5, 1.2)) ecdfAUC(x);
x<-data.frame(A=runif(100), B=rbeta(100, 0.5, 1.2)) ecdfAUC(x);
This is an internal function. When the local mean-dispersion dependence is present, data can be separated into groups based on the means. The windows used to partition groups have equal width upon each other. The shrinkage (SH) estimates for dispersion will be calculated within each group. For example, when range of the per-gene mean is 1 and 3000, if data will be separated into 3 groups, then group 1 includes the genes having mean values between 1 and 1000, group 2 includes the genes having mean values between 1001 and 2000, and group 3 includes the genes having mean values between 2001 and 3000. The SH estimates will be calculated within each of the 3 groups, respectively.
equalSpace(y, x, numcls=1, propForSigma=c(0, 1), shrinkTarget=NULL, shrinkQuantile=0.975, vb=TRUE)
equalSpace(y, x, numcls=1, propForSigma=c(0, 1), shrinkTarget=NULL, shrinkQuantile=0.975, vb=TRUE)
y |
A vector including the initial values that will be regularized. For example, it can be the per-gene method of moment (MM) estimates for dispersion based on the Negative Binomial distribution for the counts table. |
x |
A vector that will be used to separate data into groups. For example, it can be the per-gene averages for the counts table. |
numcls |
An integer that indicates the number of groups to be considered. The default value is 1. |
propForSigma |
A range vector between 0 and 1 that is used to select a subset of data. It helps users to make a flexible choice on the subset of data when they believe only part of data should be used to estimate the variation among per-gene dispersion. A default input propForSigma=c(0, 1) is recommended. It means that we want to use all the data to estimate the variation. |
shrinkTarget |
A value that represents the targeted point of stabilization for shrinkage estimates on dispersion. When shrinkTarget=NULL, the point of stabilization will be calculated according to the input of shrinkQuantile. If a numeric value is input for shrinkTarget, the shrinkQuantile argument will be ignored. |
shrinkQuantile |
A value between 0 and 1 that represents the target quantile point of stabilization for shrinkage estimates on dispersion. When a numeric value is not provided for shrinkTarget, the shrinkQuantile argument is used. The default value is NULL and means that the function will automatically estimate the point of stabilization based on the pattern of the average squared difference (ASD) between the initial method of moment (MM) estimates and the shrinkage (SH) estimates on dispersion. |
vb |
A logic value. When verbose=TRUE, the detail information will be printed in the console. |
This function returns a vector of shrinkage estimate on the basis of y.
Danni Yu
data("countsTable"); #calculate the row means; rM <- rowMeans(countsTable); #calculate the row variances; rV <- rowVars(countsTable); #calculate the method-of-moment estimation on dispersions; disp <- (rV - rM)/rM^2; #calculate SH estimates in 3 groups; disp3 <- equalSpace(disp, rM, 3); head(disp3);
data("countsTable"); #calculate the row means; rM <- rowMeans(countsTable); #calculate the row variances; rV <- rowVars(countsTable); #calculate the method-of-moment estimation on dispersions; disp <- (rV - rM)/rM^2; #calculate SH estimates in 3 groups; disp3 <- equalSpace(disp, rM, 3); head(disp3);
One exact test for only one gene.
exactNBtest1(kA, kB, mu, disp, sA=1, sB=1, rA=0.5, rB=0.5)
exactNBtest1(kA, kB, mu, disp, sA=1, sB=1, rA=0.5, rB=0.5)
kA |
An integer matrix under condition A. |
kB |
An integer under matrix condition B. |
mu |
The expectation. |
disp |
The dispersion. |
sA |
The size factors under condition A. |
sB |
The size factors under condition B. |
rA |
Proportion of samples that are under condition A. |
rB |
Proportion of samples that are under condition B. |
pval |
P-value. |
exactNBtest1(100, 150, 125, 1.1)
exactNBtest1(100, 150, 125, 1.1)
In this shrinkage approach, the per-gene dispersion is considered as a variable in large dimensions. For example, if sequences of 30,000 genes are read in a RNA-seq experiment, then the dispersion variable is distributed in 30,000 dimensions. Firstly method-of-moment (MM) estimates on dispersion are calculated under the Negative Binomial (NB) modeling respectively for each gene. Those initial estimates are independently obtained in each dimension. Since RNA-seq experiments typically includes small number of samples (such as 1,2,3,4), the per-gene MM estimates are not reliable due to the limitation of sample size. We believe that there is a common variation shared across genes. The shrinkage approach regularizes per-gene dispersion estimates toward the common variation and produces robust estimates. Therefore in the second step, the MM estimates are shrunk towards an estimated target that minimizes the average squared difference (ASD) between the initial estimates and the shrinkage estimates.
getAdjustDisp(obs, propForSigma=c(0.5, 1), shrinkTarget=NULL, shrinkQuantile=NULL, verbose=TRUE)
getAdjustDisp(obs, propForSigma=c(0.5, 1), shrinkTarget=NULL, shrinkQuantile=NULL, verbose=TRUE)
obs |
A vector of initial estimates that are used to obtain the shrinkage (SH) estimates. The length of this vector must equal to the number of rows in the counts table. For example, the method-of-moment (MM) estimates for dispersion based on the Negative Binomial (NB) distribution are the initial estimates. |
propForSigma |
A range of percentiles that is used to identify a subset of data. It helps users to make a flexible choice on the subset of data when calculating variance of initial estimates among per-gene dispersion. A default input propForSigma=c(0, 1) is recommended. It means that we want to use all the data to estimate the variance. |
shrinkTarget |
A value that represents the targeted point of stabilization for shrinkage estimates on dispersion. When shrinkTarget=NULL, the point of stabilization will be calculated according to the input of shrinkQuantile. If a numeric value is input for shrinkTarget, the shrinkQuantile argument will be ignored. |
shrinkQuantile |
A value between 0 and 1 that represents the target quantile point of stabilization for shrinkage estimates on dispersion. When a numeric value is not provided for shrinkTarget, the shrinkQuantile argument is used. The default value is NULL and means that the function will automatically estimate the point of stabilization based on the pattern of the average squared difference (ASD) between the initial method of moment (MM) estimates and the shrinkage (SH) estimates on dispersion. |
verbose |
A logic value. When verbose=TRUE, the detail information will be printed in the console. |
adj |
The SH estimates that shrink the input vector of obs toward the common information. |
cpm |
A data.frame that includes several summary statistics, such as the average and the variance of values in obs based on the subset controlled by the propForSigma argument. |
data("countsTable"); #calculate the row means; rM <- rowMeans(countsTable); #calculate the row variances; rV <- rowVars(countsTable); #obtain an initial estimates; disp <- (rV - rM)/rM^2; #calculate the shrinkage estimates that shrink the initial estimates toward the common information; dispSH <- getAdjustDisp(disp); head(dispSH);
data("countsTable"); #calculate the row means; rM <- rowMeans(countsTable); #calculate the row variances; rV <- rowVars(countsTable); #obtain an initial estimates; disp <- (rV - rM)/rM^2; #calculate the shrinkage estimates that shrink the initial estimates toward the common information; dispSH <- getAdjustDisp(disp); head(dispSH);
Calculate the size factor.
getNormFactor(countsTable1)
getNormFactor(countsTable1)
countsTable1 |
A data.frame or a matrix of counts in which a row represents for a gene and a column represents for a sample. There must be at least two columns in countsTable. |
Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11, R106.
#load a simulated data that includes a count table data("countsTable"); getNormFactor(countsTable);
#load a simulated data that includes a count table data("countsTable"); getNormFactor(countsTable);
The shrinkage target is estimated.
getQ(countsTable, sizeFactors=NULL, q.vec=NULL, plotASD=FALSE, numPart=1, propForSigma=c(0, 1), verbose=TRUE, shrinkTarget=NULL, shrinkQuantile=NULL)
getQ(countsTable, sizeFactors=NULL, q.vec=NULL, plotASD=FALSE, numPart=1, propForSigma=c(0, 1), verbose=TRUE, shrinkTarget=NULL, shrinkQuantile=NULL)
countsTable |
A data.frame or a matrix of counts in which a row represents for a gene and a column represents for a sample. There must be at least two columns in countsTable. |
sizeFactors |
A vector of values around 1 which are used to normalize between samples or libraries. The length of this vector equals to the number of columns in countsTable. |
q.vec |
A vector of sequence defines the quantiles. When q.vec=NULL, this function will generate a sequence for q.vec using seq(0.05, 0.995, 0.005). |
plotASD |
A logic value. If plotASD=TRUE, then the plot of ASD versus target points will be drawn. The SH estimates are obtained by shrinking the MM estimates toward a target point. Different SH estimates are generated using different target points. The target point that helps produce a small and stable averaged squared difference (ASD) between the MM estimates and the SH estimates is the point that approximates the common information across per- gene dispersion. |
numPart |
An integer indicates the number of groups for dispersion estimation. ‘numPart=1’ is the default value. It assumes that most of the genes share one point of stabilization (POS), and calculates the SH estimates without separating data into groups. When we assumes that genes can share different targets, the grouped SH estimates on dispersion can be be utilized. In this situation, users need to provide a number indicating the number of POS. |
propForSigma |
A range vector between 0 and 1 that is used to select a subset of data. It helps users to make a flexible choice on the subset of data when they believe only part of data should be used to estimate the variation among per-gene dispersion. A default input propForSigma=c(0, 1) is recommended. It means that we want to use all the data to estimate the variation. |
verbose |
A logic value. When verbose=TRUE, the detail information will be printed in the console. |
shrinkTarget |
A value for the shrinkage target of dispersion estimates. If “shrinkTarget=NULL" and “shrinkQuantile" is a value instead of NULL, then the quantile value for “shrinkQuantile" is converted into the scale of dispersion estimates and used as the target. If both of them are NULL, then a value that is small and minimizes the average squared difference is automatically used as the target value. If both of them are not NULL, then the value of “shrinkTarget" is used as the target. |
shrinkQuantile |
A quantile value for the shrinkage target of dispersion estimates. If “shrinkTarget=NULL" and “shrinkQuantile" is a value instead of NULL, then the quantile value for “shrinkQuantile" is converted into the scale of dispersion estimates and used as the target. If both of them are NULL, then a value that is small and minimizes the average squared difference is automatically used as the target value. If both of them are not NULL, then the value of “shrinkTarget" is used as the target. |
target |
The estimated point for stabilization that represents the common in formation across per-gene dispersion. |
q |
A value that shows the quantile of the target value across per-gene dispersion. |
#load a simulated data that includes a count table data("countsTable") conds <- c("A", "B") getQ(countsTable, plotASD=TRUE)
#load a simulated data that includes a count table data("countsTable") conds <- c("A", "B") getQ(countsTable, plotASD=TRUE)
This function is recommended to estimate the shrinkage target.
getT(countsTable, sizeFactors = NULL, q.vec = NULL, plotASD = FALSE, numPart = 1, propForSigma = c(0, 1), verbose = TRUE, shrinkTarget = NULL, shrinkQuantile = NULL, shrinkVar = FALSE, eSlope = 0.05, disp = NULL, dispXX = NULL, normalize = FALSE, lwd1 = 4.5, cexlab1 = 1.2)
getT(countsTable, sizeFactors = NULL, q.vec = NULL, plotASD = FALSE, numPart = 1, propForSigma = c(0, 1), verbose = TRUE, shrinkTarget = NULL, shrinkQuantile = NULL, shrinkVar = FALSE, eSlope = 0.05, disp = NULL, dispXX = NULL, normalize = FALSE, lwd1 = 4.5, cexlab1 = 1.2)
countsTable |
A data.frame or a matrix of counts in which a row represents for a gene and a column represents for a sample. There must be at least two columns in countsTable. |
sizeFactors |
A vector of values around 1 which are used to normalize between samples or libraries. The length of this vector equals to the number of columns in countsTable. |
q.vec |
A vector of sequence defines the quantiles. When q.vec=NULL, this function will generate a sequence for q.vec using seq(0.05, 0.995, 0.005). |
plotASD |
A logic value. If plotASD=TRUE, then the plot of ASD versus target points will be drawn. The SH estimates are obtained by shrinking the MM estimates toward a target point. Different SH estimates are generated using different target points. The target point that helps produce a small and stable averaged squared difference (ASD) between the MM estimates and the SH estimates is the point that approximates the common information across per- gene dispersion. This target point is termed as the point of stabilization. |
numPart |
An integer indicates the number of groups for dispersion estimation. ‘numPart=1’ is the default value. It assumes that most of the genes share one point of stabilization (POS), and calculates the SH estimates without separating data into groups. When we assumes that genes can share different points of stabilization, the grouped SH estimates on dispersion can be be utilized. In this situation, users need to provide a number indicating the number of POS. |
propForSigma |
A range vector between 0 and 1 that is used to select a subset of data. It helps users to make a flexible choice on the subset of data when they believe only part of data should be used to estimate the variation among per-gene dispersion. A default input propForSigma=c(0, 1) is recommended. It means that we want to use all the data to estimate the variation. |
verbose |
A logic value. When verbose=TRUE, the detail information will be printed in the console. |
shrinkTarget |
A value for the shrinkage target of dispersion estimates. If “shrinkTarget=NULL" and “shrinkQuantile" is a value instead of NULL, then the quantile value for “shrinkQuantile" is converted into the scale of dispersion estimates and used as the target. If both of them are NULL, then a value that is small and minimizes the average squared difference is automatically used as the target value. If both of them are not NULL, then the value of “shrinkTarget" is used as the target. |
shrinkQuantile |
A quantile value for the shrinkage target of dispersion estimates. If “shrinkTarget=NULL" and “shrinkQuantile" is a value instead of NULL, then the quantile value for “shrinkQuantile" is converted into the scale of dispersion estimates and used as the target. If both of them are NULL, then a value that is small and minimizes the average squared difference is automatically used as the target value. If both of them are not NULL, then the value of “shrinkTarget" is used as the target. |
shrinkVar |
A logic value. When “shrinkVariance=TRUE", the testing is based on the shrinkage estimates for variance instead of dispersion. |
eSlope |
A positive value near to zero. When selecting the shrinkage target that is small and minimizing the average squared difference (ASD), the value of “elope" is a threshold to stop the selection steps if the absolute value of a local slope for the ASD is less than the threshold. The default value is 0.05. |
disp |
A vector of initial estimates of dispersions. The length of this vector equals to the number of rows in countsTable. |
dispXX |
A vector of normalized mean expression. The length of this vector equals to the number of rows in countsTable. |
normalize |
A logic value. When estimating the shrinkage target based on the average squared difference (ASD) between the shrinkage estimates and the initial estimates, the initial estimates and ASD are normalized when “normalize=TRUE". |
lwd1 |
A value specifying the width of the curve shown in the plot for the average squared difference when “plotASD=TRUE". The default value is 4.5. |
cexlab1 |
A value specifying the size of label text shown in the plot for the average squared difference when “plotASD=TRUE". The default value is 1.2. |
target |
The estimated point for stabilization that represents the common in formation across per-gene dispersion. |
q |
A value that shows the quantile of the target value across per- gene dispersion. |
#load a simulated data that includes a count table data("countsTable") conds <- c("A", "B") getT(countsTable, plotASD=TRUE)
#load a simulated data that includes a count table data("countsTable") conds <- c("A", "B") getT(countsTable, plotASD=TRUE)
Internal function where there are multiple shrinkage targets.
getTgroup
getTgroup
data("countsTable") conds <- c("A", "B") resSH <- nbTestSH( countsTable, conds, "A", "B", numPart=10)
data("countsTable") conds <- c("A", "B") resSH <- nbTestSH( countsTable, conds, "A", "B", numPart=10)
A subset of the real experiment Hammer et al. It is used as an example for running some functions in this package.
data(Hammer2months)
data(Hammer2months)
A data.frame containing 4 columns and 29516 rows.
It compares gene expression in rat strains Sprague Dawley and L5 SNL Sprague Dawley 2 at the end of two months in a factorial design. Two distinct biological libraries per condition were quantified using the Illumina platform.
http://bowtie-bio.sourceforge.net/recount/
Hammer, P. et al. (2010). mRNA-seq with agnostic splice site discovery for nervous system transcriptomics tested in chronic pain. Genome Res., 20, 847-860.
Frazee, A. et al. (2011). ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics, 12, 449.
data(Hammer2months); head(countsTable);
data(Hammer2months); head(countsTable);
This is an internal function used by nbTestSH
.
It calculates the exact per-gene probabilities for p-values, and tests the
null hypothesis that the expected expression of a gene under two conditions
are not different.
nbinomTestForMatricesSH(countsA, countsB, sizeFactorsA, sizeFactorsB, numPart=1, SHonly=FALSE, propForSigma=c(0, 1), shrinkTarget=NULL, shrinkQuantile=NULL, cLA, cLB, contrast=NULL, keepLevelsConsistant=TRUE, useMMdisp=FALSE, shrinkVariance=FALSE, pairedDesign=FALSE, pairedDesign.dispMethod="per-pair", useFisher=FALSE, Dispersions=NULL, eSlope=NULL, plotASD=FALSE, lwd_ASD=4.5, cex_ASD=1.2)
nbinomTestForMatricesSH(countsA, countsB, sizeFactorsA, sizeFactorsB, numPart=1, SHonly=FALSE, propForSigma=c(0, 1), shrinkTarget=NULL, shrinkQuantile=NULL, cLA, cLB, contrast=NULL, keepLevelsConsistant=TRUE, useMMdisp=FALSE, shrinkVariance=FALSE, pairedDesign=FALSE, pairedDesign.dispMethod="per-pair", useFisher=FALSE, Dispersions=NULL, eSlope=NULL, plotASD=FALSE, lwd_ASD=4.5, cex_ASD=1.2)
countsA |
A counts table under condition “condA". |
countsB |
A counts table under condition “condB". |
sizeFactorsA |
A vector of size factors under condition “condA". |
sizeFactorsB |
A vector of size factors under condition “condB". |
numPart |
An integer indicating the number of targets for the shrinkage dispersion estimates. “numPart=1" is the default value. It assumes that all the genes share one common target, and then the method of moment estimates are shrunk toward one single target. When it is assumed that the genes share multiple targets, the value for “numPart" is the number of targets and the grouped shrinkage estimates for dispersion are calculated. |
SHonly |
If ‘SHonly’ is TRUE, then the function outputs the shrinkage estimates for dispersion without testing the differentiation between conditions. If FALSE, then the function outputs a data frame including the per-gene p-values of tests. |
propForSigma |
A range vector between 0 and 1 that is used to select a subset of data. It helps users to make a flexible choice on the subset of data when they believe only part of data should be used to estimate the variation among per-gene dispersion. An input “propForSigma=c(0.1, 0.9)" means that the genes having method of moment estimates for dispersion greater than the 10th quantile and less than the 90th quantile are used to estimate the dispersion variation. The default input “propForSigma=c(0, 1)" is recommended. It means that we want to use all the data to estimate the dispersion variation. |
shrinkTarget |
A value for the shrinkage target of dispersion estimates. If “shrinkTarget=NULL" and “shrinkQuantile" is a value instead of NULL, then the quantile value for “shrinkQuantile" is converted into the scale of dispersion estimates and used as the target. If both of them are NULL, then a value that is small and minimizes the average squared difference is automatically used as the target value. If both of them are not NULL, then the value of “shrinkTarget" is used as the target. |
shrinkQuantile |
A quantile value for the shrinkage target of dispersion estimates. If “shrinkTarget=NULL" and “shrinkQuantile" is a value instead of NULL, then the quantile value for “shrinkQuantile" is converted into the scale of dispersion estimates and used as the target. If both of them are NULL, then a value that is small and minimizes the average squared difference is automatically used as the target value. If both of them are not NULL, then the value of “shrinkTarget" is used as the target. |
cLA |
A data.frame indicating the levels or extra factors under condition “condA". |
cLB |
A data.frame indicating the levels or extra factors under condition “condB". |
contrast |
A contrast vector for testing in complex experiments. The length of this vector equals to the number of columns in countsTable. |
keepLevelsConsistant |
A logic TRUE/FALSE value. When “coLevels" is used to indicate a paired design experiment, “keepLevelsConsistant=TRUE" silences the genes that have different changing directions (i.e. positive and negative test statistics) among individual samples by setting their p-values as 1. |
useMMdisp |
A logic value. When “useMMdisp=TRUE" the method of moment (MM) estimates for dispersion without any shrinkage approach are used for testing the differentiation of genes between two conditions. |
shrinkVariance |
A logic value. When “shrinkVariance=TRUE", the testing is based on the shrinkage estimates for variance instead of dispersion. |
pairedDesign |
A logic value. When pairedDesign=TRUE is specified, the tests are performed
specifically for the paired design experiment. The Null hypotheses
|
pairedDesign.dispMethod |
A character specifying the method of selecting data used for the paired design experiment. When the input is “per-pair" (the default input), the dispersion estimates are shrunk within each pair of samples. The shrinkage target is different in different pair of samples. When the input is "pooled", firstly method of moment estimates for dispersion are obtained within each pair of samples, and then the average estimates across all pairs of samples are shrunk toward a common targets among genes. |
useFisher |
A logic value specifying whether Fisher's method of combining multiple
p-values for a gene is used in the paired design experiment. In detail the
formula of calculating the Fisher's combined p-values is
pval |
Dispersions |
If it is not null, then the input is a vector of known dispersion values. The length of the vector equals to the number of genes in the counts table. The default value is “NULL". |
eSlope |
A positive value near to zero. When selecting the shrinkage target that is small and minimizing the average squared difference (ASD), the value of “elope" is a threshold to stop the selection steps if the absolute value of a local slope for the ASD is less than the threshold. The default value is 0.05. |
plotASD |
A logic value. If plotASD=TRUE, then the plot of average squared difference (ASD) versus target points is produced. The shrinkage (SH) estimates are obtained by shrinking the method of moment (MM) estimates toward a target. In the figure, the vertical axis are ASD values when the shrinkage target (represented by the horizontal axis) varies within the range of dispersion estimates. The selected target is a small value minimizing ASD. |
lwd_ASD |
A value specifying the width of the curve shown in the plot for the average squared difference when “plotASD=TRUE". The default value is 4.5. |
cex_ASD |
A value specifying the size of label text shown in the plot for the average squared difference when “plotASD=TRUE". The default value is 1.2. |
This is the main function calculating the exact per-gene probabilities for p-values. It tests the null hypothesis that the expected expression of a gene under two conditions are the same.
nbTestSH(countsTable, conds, condA = "A", condB = "B", numPart = 1, SHonly = FALSE, propForSigma = c(0, 1), shrinkTarget = NULL, shrinkQuantile = NULL, plotASD = FALSE, coLevels = NULL, contrast = NULL, keepLevelsConsistant = FALSE, useMMdisp = FALSE, addRawData = FALSE, shrinkVariance = FALSE, pairedDesign = FALSE, pairedDesign.dispMethod = "per-pair", useFisher = FALSE, Dispersions = NULL, eSlope = 0.05, lwd_ASD = 4.5, cex_ASD = 1.2)
nbTestSH(countsTable, conds, condA = "A", condB = "B", numPart = 1, SHonly = FALSE, propForSigma = c(0, 1), shrinkTarget = NULL, shrinkQuantile = NULL, plotASD = FALSE, coLevels = NULL, contrast = NULL, keepLevelsConsistant = FALSE, useMMdisp = FALSE, addRawData = FALSE, shrinkVariance = FALSE, pairedDesign = FALSE, pairedDesign.dispMethod = "per-pair", useFisher = FALSE, Dispersions = NULL, eSlope = 0.05, lwd_ASD = 4.5, cex_ASD = 1.2)
countsTable |
A data.frame or a matrix of counts in which a row represents for a gene and a column represents for a sample. There must be at least two columns in countsTable. |
conds |
A vector of characters representing the two conditions (or two groups). It must be matchable to the columns in countsTable. For example, c("A", "A", "B", "B") matches to a countsTable that has four columns (or samples) in which the first two columns are samples under condition A and the last two columns are samples under condition B. |
condA |
A character specifying the first condition in countsTable, e.g. condA="A". |
condB |
A character specifying the second condition in countsTable, e.g. condB="B". |
numPart |
An integer indicating the number of targets for the shrinkage dispersion estimates. “numPart=1" is the default value. It assumes that all the genes share one common target, and then the method of moment estimates are shrunk toward one single target. When it is assumed that the genes share multiple targets, the value for “numPart" is the number of targets and the grouped shrinkage estimates for dispersion are calculated. |
SHonly |
If ‘SHonly’ is TRUE, then the function outputs the shrinkage estimates for dispersion without testing the differentiation between conditions. If FALSE, then the function outputs a data frame including the per-gene p- values of tests. |
propForSigma |
A range vector between 0 and 1 that is used to select a subset of data. It helps users to make a flexible choice on the subset of data when they believe only part of data should be used to estimate the variation among per-gene dispersion. An input “propForSigma=c(0.1, 0.9)" means that the genes having method of moment estimates for dispersion greater than the 10th quantile and less than the 90th quantile are used to estimate the dispersion variation. The default input “propForSigma=c(0, 1)" is recommended. It means that we want to use all the data to estimate the dispersion variation. |
shrinkTarget |
A value for the shrinkage target of dispersion estimates. If “shrinkTarget=NULL" and “shrinkQuantile" is a value instead of NULL, then the quantile value for “shrinkQuantile" is converted into the scale of dispersion estimates and used as the target. If both of them are NULL, then a value that is small and minimizes the average squared difference is automatically used as the target value. If both of them are not NULL, then the value of “shrinkTarget" is used as the target. |
shrinkQuantile |
A quantile value for the shrinkage target of dispersion estimates. If “shrinkTarget=NULL" and “shrinkQuantile" is a value instead of NULL, then the quantile value for “shrinkQuantile" is converted into the scale of dispersion estimates and used as the target. If both of them are NULL, then a value that is small and minimizes the average squared difference is automatically used as the target value. If both of them are not NULL, then the value of “shrinkTarget" is used as the target. |
plotASD |
A logic value. If plotASD=TRUE, then the plot of average squared difference (ASD) versus target points is produced. The shrinkage (SH) estimates are obtained by shrinking the method of moment (MM) estimates toward a point target. In the figure, the vertical axis are ASD values when the shrinkage target (represented by the horizontal axis) varies within the range of dispersion estimates. The selected target is a small value minimizing ASD. |
coLevels |
A data.frame specifying the additional factors for testing in complex experiments. The number of row in “coLevels" matches the number of columns in countsTable. It describes the extra features or factors other than the two basic conditions. For example, “conds=c("A","A","B","B")" and “coLevels=data.frame(sample=c(1,2,1,2))" indicate a paired design experiment. Column 1 and 3 in countsTable are a paired observations for sample 1 in two different conditions. |
contrast |
A contrast vector for testing in complex experiments. The length of this vector equals to the number of columns in countsTable. |
keepLevelsConsistant |
A logic TRUE/FALSE value. When “coLevels" is used to indicate a paired design experiment, “keepLevelsConsistant=TRUE" silences the genes that have different changing directions (i.e. positive and negative test statistics) among individual samples by setting their p-values as 1. |
useMMdisp |
A logic value. When “useMMdisp=TRUE" the method of moment (MM) estimates for dispersion without any shrinkage approach are used for testing the differentiation of genes between two conditions. |
addRawData |
A logic value. When “addRawData=TRUE", this function also outputs the original values of countsTable. |
shrinkVariance |
A logic value. When “shrinkVariance=TRUE", the testing is based on the shrinkage estimates for variance instead of dispersion. |
pairedDesign |
A logic value. When pairedDesign=TRUE is specified, the tests are performed
specifically for the paired design experiment. The Null hypotheses
|
pairedDesign.dispMethod |
A character specifying the method of selecting data used for the paired design experiment. When the input is “per-pair" (the default input), the dispersion estimates are shrunk within each pair of samples. The shrinkage target is different in different pair of samples. When the input is "pooled", firstly method of moment estimates for dispersion are obtained within each pair of samples, and then the average estimates across all pairs of samples are shrunk toward a common targets among genes. |
useFisher |
A logic value specifying whether Fisher's method of combining multiple p-
values for a gene is used in the paired design experiment. In detail the
formula of calculating the Fisher's combined p-values is
pval |
Dispersions |
If it is not null, then the input is a vector of known dispersion values. The length of the vector equals to the number of genes in the counts table. The default value is “NULL". |
eSlope |
A positive value near to zero. When selecting the shrinkage target that is small and minimizing the average squared difference (ASD), the value of “elope" is a threshold to stop the selection steps if the absolute value of a local slope for the ASD is less than the threshold. The default value is 0.05. |
lwd_ASD |
A value specifying the width of the curve shown in the plot for the average squared difference when “plotASD=TRUE". The default value is 4.5. |
cex_ASD |
A value specifying the size of label text shown in the plot for the average squared difference when “plotASD=TRUE". The default value is 1.2. |
Mean |
The row per-gene averages over the values in countsTable. |
log2FoldChange |
The per-gene fold Changes between condition A and B in the log2 scale. |
dispMM |
The per-gene method of moment (MM) estimates on dispersion. |
dispSH |
The per-gene shrinkage (SH) estimates on dispersion. |
pval |
The per-gene p-values based on the exact tests. Smaller p-value indicates a higher chance of rejecting the null hypothesis that the expected gene expression distributes identically between the two conditions. |
Yu, D., Huber, W. and Vitek O. (2013). Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size. Bioinformatics.
#load a simulated data that includes a count table data("countsTable") #Differential analysis in sSeq. conds <- c("A", "B") resSH <- nbTestSH( countsTable, conds, "A", "B") #If users only want to calculate the SH dispersion estimates and #draw a mean-dispersion plot, the following scripts can be used. library('RColorBrewer') dispSH <- nbTestSH( countsTable, conds, "A", "B", SHonly=TRUE) plotDispersion(dispSH)
#load a simulated data that includes a count table data("countsTable") #Differential analysis in sSeq. conds <- c("A", "B") resSH <- nbTestSH( countsTable, conds, "A", "B") #If users only want to calculate the SH dispersion estimates and #draw a mean-dispersion plot, the following scripts can be used. library('RColorBrewer') dispSH <- nbTestSH( countsTable, conds, "A", "B", SHonly=TRUE) plotDispersion(dispSH)
This function is used to draw a scatter plot of dispersion versus mean of count table. It helps to visually inspect the dependence between the dispersion estimates and the mean estimates.
plotDispersion(DispSH, extraOutput=NULL, plotMethod="logDisp", ylim1=NULL, legPos="topleft", myCol=brewer.pal(9, "Set1"), tt=NULL)
plotDispersion(DispSH, extraOutput=NULL, plotMethod="logDisp", ylim1=NULL, legPos="topleft", myCol=brewer.pal(9, "Set1"), tt=NULL)
DispSH |
A data frame includes ‘SH’, ‘raw’, and ‘mus’. They are the shrinkage estimates of dispersion, the method of moment estimates of dispersion, and the estimates of mean. This data frame is obtained using the function ‘nbTestSH’ and specifying ‘SHonly=TRUE’. |
extraOutput |
A data.frame including dispersion estimates and expectation estimates using another method. When users want to compare the dispersion estimates using two different method, this argument can be used to include the result from the second method. The default value is NULL. This means that no extra method is compared. |
plotMethod |
If plotMethod=“logDisp" which is the default, then both dispersion and mean estimates are shown in the log scale. If plotMethod=“Disp", then only mean estimates are shown in the log scale. |
ylim1 |
A vector of two values that specifies the minimum and maximum values of the vertical y axis in the plot. It is used to limit the presenting range of y axis in the plot. If ylime1=NULL then the range of the shrinkage estimates of dispersion is used. |
legPos |
A character indicating the position of legend in the plot. The value of this argument can be "topleft", "topright", "bottomleft" and "bottomright". |
myCol |
A vector of colors corresponding to the dispersions estimated using different methods. |
tt |
A character representing the title of the plot, which is shown on the top in the plot. |
data("countsTable") conds <- c("A", "B") dispSH <- nbTestSH( countsTable, conds, "A", "B", SHonly=TRUE) library('RColorBrewer') plotDispersion(dispSH, legPos="topleft")
data("countsTable") conds <- c("A", "B") dispSH <- nbTestSH( countsTable, conds, "A", "B", SHonly=TRUE) library('RColorBrewer') plotDispersion(dispSH, legPos="topleft")
This function is based on the re-parameterized Negative Binomial distribution to generate random observations.
rnbinomMV(n, mu, v)
rnbinomMV(n, mu, v)
n |
The number of values that will be randomly generated. |
mu |
The expectation of the Negative Binomial distribution. |
v |
The variance of the Negative Binomial distribution. |
x <- rnbinomMV(50, 10, 15) hist(x)
x <- rnbinomMV(50, 10, 15) hist(x)
This function helps to obtain row-wise estimation across columns.
rowVars(x)
rowVars(x)
x |
A matrix or data.frame that includes multiple columns. |
A vector showing the per-row variance estimates for the matrix or data.frame.
x <- matrix(rnorm(10), 5) rowVars(x)
x <- matrix(rnorm(10), 5) rowVars(x)
This function is used to approximate the real experiment and to generate simulated counts table based on Negative Binomial distribution.
sim(ngenes, true_mean1, conds, alpha = function(m) {rep(0.1, length(m))}, mean_DE = 0, sd_DE = 2, s0 = NULL, s0_mean = 2, s0_sd = 3, true_isDE_proportion = 0.3)
sim(ngenes, true_mean1, conds, alpha = function(m) {rep(0.1, length(m))}, mean_DE = 0, sd_DE = 2, s0 = NULL, s0_mean = 2, s0_sd = 3, true_isDE_proportion = 0.3)
ngenes |
The total number of genes or rows in the simulated counts table. |
true_mean1 |
The expected gene expression ( |
conds |
A vector of characters representing the two conditions (or two groups). It must be matchable to the columns in countsTable, e.g., c("A", "A", "B", "B") matches to a countsTable that has four columns (or samples) in which the first two columns are samples under condition A and the last two columns are samples under condition B. |
alpha |
A function used to generate the true dispersion values. The default function generates a constant 0.1 for all the genes. It can also be a function specifying the dependence between dispersion and mean. |
mean_DE |
A true mean value of |
sd_DE |
A true standard deviation of |
s0 |
The true size factors for samples. The length of this vector equals to the length of the vector ‘conds’. |
s0_mean |
If the true size factors for samples are not defined for ‘s0’, then the true size factors are assumed to follow a Normal distribution with mean as the value for ‘s0_mean’. |
s0_sd |
If the true size factors for samples are not defined for ‘s0’, then the true size factors are assumed to follow a Normal distribution with standard deviation as the value for ‘s0_sd’. |
true_isDE_proportion |
The proportion of genes that are truly different. The default value is 0.3. |
The function outputs a list including the simulated counts table, a vector with TRUE of FALSE values indicating the truly differentiating genes, the true mean values, the true variance values, and the true dispersion values.
We acknowledge Dr. Simon Anders since he provided the details for simulation in the manual of DESeq package.
Yu, D., Huber, W. and Vitek O. (2013). Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size. Bioinformatics.
ng = 10000; sim1 <- sim(ngenes=ng, conds=c("A","A","B","B"), true_mean1=round(rexp(ng, rate=1/200)), alpha=function(m){1/(m+100)}, mean_DE=2, sd_DE=1, s0=runif(4, 0, 2) ); true_isDE <- sim1$true_isDE; countsTable <- sim1$countsTable;
ng = 10000; sim1 <- sim(ngenes=ng, conds=c("A","A","B","B"), true_mean1=round(rexp(ng, rate=1/200)), alpha=function(m){1/(m+100)}, mean_DE=2, sd_DE=1, s0=runif(4, 0, 2) ); true_isDE <- sim1$true_isDE; countsTable <- sim1$countsTable;
A subset of the real experiment Sultan et al. It is used as an example for running some functions in this package.
data(Sultan)
data(Sultan)
A data.frame containing 4 columns and 52580 rows.
It compares two biological replicates of human cell lines Ramos B and HEK293T with the Illumina platform.
http://bowtie-bio.sourceforge.net/recount/
Sultan, M. et al. (2008). A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science, 321, 956.
Frazee, A. et al. (2011). ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics, 12, 449.
data(Sultan); head(countsTable);
data(Sultan); head(countsTable);
A subset of the real experiment Tuch et al. It is used as an example for running some functions in this package.
data(Tuch)
data(Tuch)
A data.frame containing 6 columns and 10453 rows.
It compares the expression of genes in normal human tissues and in tissues with oral squamous cell carcinoma. The experiment had a paired design in that pairs of normal and tumor samples were obtained from three patient. The six libraries were sequenced using the SOLiD platform.
The table of read counts was downloaded from GEO (accession GSE20116).
Tuch, B. et al. (2010). Tumor transcriptome sequencing reveals allelic expression imbalances associated with copy number alterations. PloS One, 5, e9317.
data(Tuch); head(countsTable);
data(Tuch); head(countsTable);