Title: | Gene selection based on the marginal distributions of gene profiles that characterized by a mixture of three-component multivariate distributions |
---|---|
Description: | Gene selection based on a mixture of marginal distributions. |
Authors: | Jarrett Morrow <[email protected]>, Weiliang Qiu <[email protected]>, Wenqing He <[email protected]>, Xiaogang Wang <[email protected]>, Ross Lazarus <[email protected]>. |
Maintainer: | Weiliang Qiu <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.51.0 |
Built: | 2024-11-29 05:54:30 UTC |
Source: | https://github.com/bioc/GeneSelectMMD |
Calculating FDR, FNDR, FPR, and FNR for a real microarray data set based on the mixture of marginal distributions.
errRates(obj.gsMMD)
errRates(obj.gsMMD)
obj.gsMMD |
an object returned by |
We first fit the real microarray data set by the mixture of marginal distributions. Then we calculate the error rates based on the posterior distributions of a gene belonging to a gene cluster given its gene profiles. Please refer to Formula (7) on the page 6 of the paper listed in the Reference section.
A vector of 4 elements:
FDR |
the percentage of nondifferentially expressed genes among selected genes. |
FNDR |
the percentage of differentially expressed genes among unselected genes. |
FPR |
the percentage of selected genes among nondifferentially expressed genes |
FNR |
the percentage of un-selected genes among differentially expressed genes |
Jarrett Morrow [email protected], Weiliang Qiu [email protected], Wenqing He [email protected], Xiaogang Wang [email protected], Ross Lazarus [email protected]
Qiu, W.-L., He, W., Wang, X.-G. and Lazarus, R. (2008). A Marginal Mixture Model for Selecting Differentially Expressed Genes across Two Types of Tissue Samples. The International Journal of Biostatistics. 4(1):Article 20. http://www.bepress.com/ijb/vol4/iss1/20
## Not run: library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0,nSubjects) # B3 coded as 0, T2 coded as 1 memSubjects[mem.str == "T2"] <- 1 obj.gsMMD <- gsMMD(eSet1, memSubjects, transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE, quiet = FALSE) round(errRates(obj.gsMMD), 3) ## End(Not run)
## Not run: library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0,nSubjects) # B3 coded as 0, T2 coded as 1 memSubjects[mem.str == "T2"] <- 1 obj.gsMMD <- gsMMD(eSet1, memSubjects, transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE, quiet = FALSE) round(errRates(obj.gsMMD), 3) ## End(Not run)
Gene selection based on the marginal distributions of gene profiles that characterized by a mixture of three-component multivariate distributions.
Input is an object derived from the class ExpressionSet
. The function will obtain initial gene cluster membership by its own.
gsMMD(obj.eSet, memSubjects, maxFlag = TRUE, thrshPostProb = 0.5, geneNames = NULL, alpha = 0.05, iniGeneMethod = "Ttest", transformFlag = FALSE, transformMethod = "boxcox", scaleFlag = TRUE, criterion = c("cor", "skewness", "kurtosis"), minL = -10, maxL = 10, stepL = 0.1, eps = 0.001, ITMAX = 100, plotFlag = FALSE, quiet=TRUE)
gsMMD(obj.eSet, memSubjects, maxFlag = TRUE, thrshPostProb = 0.5, geneNames = NULL, alpha = 0.05, iniGeneMethod = "Ttest", transformFlag = FALSE, transformMethod = "boxcox", scaleFlag = TRUE, criterion = c("cor", "skewness", "kurtosis"), minL = -10, maxL = 10, stepL = 0.1, eps = 0.001, ITMAX = 100, plotFlag = FALSE, quiet=TRUE)
obj.eSet |
an object derived from the class |
memSubjects |
a vector of membership of subjects. |
maxFlag |
logical. Indicate how to assign gene class membership. |
thrshPostProb |
threshold for posterior probabilities. For example, if the posterior probability that a gene belongs to cluster 1 given its gene expression levels is larger than |
geneNames |
an optional character vector of gene names |
alpha |
significant level which is equal to |
iniGeneMethod |
method to get initial 3-cluster partition of genes. Available methods are: “Ttest”, “Wilcox”. |
transformFlag |
logical. Indicate if data transformation is needed |
transformMethod |
method for transforming data. Available methods include "boxcox", "log2", "log10", "log", "none". |
scaleFlag |
logical. Indicate if gene profiles are to be scaled to have mean zero and variance one. If |
criterion |
if |
minL |
lower limit for the |
maxL |
upper limit for the |
stepL |
tolerance when searching the optimal |
eps |
a small positive value. If the absolute value of a value is smaller than |
ITMAX |
maximum iteration allowed for iterations in the EM algorithm |
plotFlag |
logical. Indicate if the Box-Cox normality plot should be output. |
quiet |
logical. Indicate if intermediate results should be printed out. |
We assume that the distribution of gene expression profiles is
a mixture of 3-component multivariate normal distributions
. Each component distribution
corresponds to a gene cluster. The 3 components correspond to 3 gene clusters:
(1) up-regulated gene cluster, (2) non-differentially expressed gene cluster,
and (3) down-regulated gene cluster.
The model parameter vector is
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
.
where
,
, and
are the mixing proportions;
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 1 (up-regulated genes) for diseased subjects;
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 1 (up-regulated genes) for non-diseased subjects;
,
, and
are the marginal mean,
variance, and correlation of gene
expression levels of cluster 2 (non-differentially expressed genes);
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 3 (up-regulated genes) for diseased subjects;
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 3 (up-regulated genes) for non-diseased subjects.
Note that genes in cluster 2 are non-differentially expressed across abnormal and normal tissue samples. Hence there are only 3 parameters for cluster 2.
To make sure the identifiability, we set the following contraints:
and
.
To make sure the marginal covariance matrices are poisitive definite,
we set the following contraints:
,
,
,
,
.
We also has the following constraints for the mixing proportion:
,
,
.
We apply the EM algorithm to estimate the model parameters. We regard the cluster membership of genes as missing values.
To facilitate the estimation of the parameters,
we reparametrize the parameter vector as
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
where
,
,
,
,
,
,
.
Given a gene, the expression levels of the gene are assumed independent. However, after scaling, the scaled expression levels of the gene are no longer independent and the rank of the covariance matrix for the scaled gene profile will be one less than the rank
for the un-scaled gene profile Hence the covariance matrix of the
gene profile will no longer be positive-definite.
To avoid this problem,
we delete a tissue sample after scaling since its information has been incorrporated by other scaled tissue samples. We arbitrarily select the tissue sample, which has the biggest label number, from the tissue sample group that has larger size than the other tissue sample group. For example, if there are 6 cancer tissue samples and 10 normal tissue samples, we delete the 10-th normal tissue sample after scaling.
A list contains 18 elements.
dat |
the (transformed) microarray data matrix. If tranformation
performed, then |
memSubjects |
the same as the input |
memGenes |
a vector of cluster membership of genes. |
memGenes2 |
an variant of the vector of cluster membership of genes.
|
para |
parameter estimates (c.f. details). |
llkh |
value of the loglikelihood function. |
wiMat |
posterior probability that a gene belongs to a cluster given the expression levels of this gene. Column i is for cluster i. |
wiArray |
posterior probability matrix for different initial gene selection methods. |
memIniMat |
a matrix of initial cluster membership of genes. |
paraIniMat |
a matrix of parameter estimates based on initial gene cluster membership. |
llkhIniVec |
a vector of values of loglikelihood function. |
memMat |
a matrix of cluster membership of genes based on the mixture of marginal models with initial parameter estimates obtained initial gene cluster membership. |
paraMat |
a matrix of parameter estimates based on the mixture of marginal models with initial parameter estimates obtained initial gene cluster membership. |
llkhVec |
a vector of values of loglikelihood function based on the mixture of marginal models with initial parameter estimates obtained initial gene cluster membership. |
lambda |
the parameter used to do Box-Cox transformation |
paraRP |
parameter estimates for reparametrized parameter vector (c.f. details). |
paraIniMatRP |
a matrix of parameter estimates for reparametrized parameter vector based on initial gene cluster membership. |
paraMatRP |
a matrix of parameter estimates for reparametrized parameter vector based on the mixture of marginal models with initial parameter estimates obtained initial gene cluster membership. |
The speed of the program can be slow for large data sets, however it has been improved using Fortran code.
Jarrett Morrow [email protected], Weiliang Qiu [email protected], Wenqing He [email protected], Xiaogang Wang [email protected], Ross Lazarus [email protected]
Qiu, W.-L., He, W., Wang, X.-G. and Lazarus, R. (2008). A Marginal Mixture Model for Selecting Differentially Expressed Genes across Two Types of Tissue Samples. The International Journal of Biostatistics. 4(1):Article 20. http://www.bepress.com/ijb/vol4/iss1/20
gsMMD.default
,
gsMMD2
,
gsMMD2.default
library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0,nSubjects) # B3 coded as 0, T2 coded as 1 memSubjects[mem.str == "T2"] <- 1 obj.gsMMD <- gsMMD(eSet1, memSubjects, transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE, quiet = FALSE) round(obj.gsMMD$para, 3)
library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0,nSubjects) # B3 coded as 0, T2 coded as 1 memSubjects[mem.str == "T2"] <- 1 obj.gsMMD <- gsMMD(eSet1, memSubjects, transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE, quiet = FALSE) round(obj.gsMMD$para, 3)
Gene selection based on the marginal distributions of gene profiles that characterized by a mixture of three-component multivariate distributions. Input is a data matrix. The function will obtain initial gene cluster membership by its own.
gsMMD.default(X, memSubjects, maxFlag = TRUE, thrshPostProb = 0.5, geneNames = NULL, alpha = 0.05, iniGeneMethod = "Ttest", transformFlag = FALSE, transformMethod = "boxcox", scaleFlag = TRUE, criterion = c("cor", "skewness", "kurtosis"), minL = -10, maxL = 10, stepL = 0.1, eps = 0.001, ITMAX = 100, plotFlag = FALSE, quiet=TRUE)
gsMMD.default(X, memSubjects, maxFlag = TRUE, thrshPostProb = 0.5, geneNames = NULL, alpha = 0.05, iniGeneMethod = "Ttest", transformFlag = FALSE, transformMethod = "boxcox", scaleFlag = TRUE, criterion = c("cor", "skewness", "kurtosis"), minL = -10, maxL = 10, stepL = 0.1, eps = 0.001, ITMAX = 100, plotFlag = FALSE, quiet=TRUE)
X |
a data matrix. The rows of the matrix are genes. The columns of the matrix are subjects. |
memSubjects |
a vector of membership of subjects. |
maxFlag |
logical. Indicate how to assign gene class membership. |
thrshPostProb |
threshold for posterior probabilities. For example, if the posterior probability that a gene belongs to cluster 1 given its gene expression levels is larger than |
geneNames |
an optional character vector of gene names |
alpha |
significant level which is equal to |
iniGeneMethod |
method to get initial 3-cluster partition of genes. Available methods are: “Ttest”, “Wilcox”. |
transformFlag |
logical. Indicate if data transformation is needed |
transformMethod |
method for transforming data. Available methods include "boxcox", "log2", "log10", "log", "none". |
scaleFlag |
logical. Indicate if gene profiles are to be scaled. If |
criterion |
if |
minL |
lower limit for the |
maxL |
upper limit for the |
stepL |
tolerance when searching the optimal |
eps |
a small positive value. If the absolute value of a value is smaller than |
ITMAX |
maximum iteration allowed for iterations in the EM algorithm |
plotFlag |
logical. Indicate if the Box-Cox normality plot should be output. |
quiet |
logical. Indicate if intermediate results should be printed out. |
We assume that the distribution of gene expression profiles is
a mixture of 3-component multivariate normal distributions
. Each component distribution
corresponds to a gene cluster. The 3 components correspond to 3 gene clusters:
(1) up-regulated gene cluster, (2) non-differentially expressed gene cluster,
and (3) down-regulated gene cluster.
The model parameter vector is
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
.
where
,
, and
are the mixing proportions;
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 1 (up-regulated genes) for diseased subjects;
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 1 (up-regulated genes) for non-diseased subjects;
,
, and
are the marginal mean,
variance, and correlation of gene
expression levels of cluster 2 (non-differentially expressed genes);
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 3 (up-regulated genes) for diseased subjects;
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 3 (up-regulated genes) for non-diseased subjects.
Note that genes in cluster 2 are non-differentially expressed across abnormal and normal tissue samples. Hence there are only 3 parameters for cluster 2.
To make sure the identifiability, we set the following contraints:
and
.
To make sure the marginal covariance matrices are poisitive definite,
we set the following contraints:
,
,
,
,
.
We also has the following constraints for the mixing proportion:
,
,
.
We apply the EM algorithm to estimate the model parameters. We regard the cluster membership of genes as missing values.
To facilitate the estimation of the parameters,
we reparametrize the parameter vector as
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
where
,
,
,
,
,
,
.
Given a gene, the expression levels of the gene are assumed independent. However, after scaling, the scaled expression levels of the gene are no longer independent and the rank of the covariance matrix for the scaled gene profile will be one less than the rank
for the un-scaled gene profile Hence the covariance matrix of the
gene profile will no longer be positive-definite.
To avoid this problem,
we delete a tissue sample after scaling since its information has been incorrporated by other scaled tissue samples. We arbitrarily select the tissue sample, which has the biggest label number, from the tissue sample group that has larger size than the other tissue sample group. For example, if there are 6 cancer tissue samples and 10 normal tissue samples, we delete the 10-th normal tissue sample after scaling.
A list contains 18 elements.
dat |
the (transformed) microarray data matrix. If tranformation
performed, then |
memSubjects |
the same as the input |
memGenes |
a vector of cluster membership of genes. |
memGenes2 |
an variant of the vector of cluster membership of genes.
|
para |
parameter estimates (c.f. details). |
llkh |
value of the loglikelihood function. |
wiMat |
posterior probability that a gene belongs to a cluster given the expression levels of this gene. Column i is for cluster i. |
wiArray |
posterior probability matrix for different initial gene selection methods. |
memIniMat |
a matrix of initial cluster membership of genes. |
paraIniMat |
a matrix of parameter estimates based on initial gene cluster membership. |
llkhIniVec |
a vector of values of loglikelihood function. |
memMat |
a matrix of cluster membership of genes based on the mixture of marginal models with initial parameter estimates obtained initial gene cluster membership. |
paraMat |
a matrix of parameter estimates based on the mixture of marginal models with initial parameter estimates obtained initial gene cluster membership. |
llkhVec |
a vector of values of loglikelihood function based on the mixture of marginal models with initial parameter estimates obtained initial gene cluster membership. |
lambda |
the parameter used to do Box-Cox transformation |
paraRP |
parameter estimates for reparametrized parameter vector (c.f. details). |
paraIniMatRP |
a matrix of parameter estimates for reparametrized parameter vector based on initial gene cluster membership. |
paraMatRP |
a matrix of parameter estimates for reparametrized parameter vector based on the mixture of marginal models with initial parameter estimates obtained initial gene cluster membership. |
The speed of the program can be slow for large data sets, however it has been improved using Fortran code.
Jarrett Morrow [email protected], Weiliang Qiu [email protected], Wenqing He [email protected], Xiaogang Wang [email protected], Ross Lazarus [email protected]
Qiu, W.-L., He, W., Wang, X.-G. and Lazarus, R. (2008). A Marginal Mixture Model for Selecting Differentially Expressed Genes across Two Types of Tissue Samples. The International Journal of Biostatistics. 4(1):Article 20. http://www.bepress.com/ijb/vol4/iss1/20
## Not run: library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mat <- exprs(eSet1) mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0,nSubjects) # B3 coded as 0, T2 coded as 1 memSubjects[mem.str == "T2"] <- 1 obj.gsMMD <- gsMMD.default(mat, memSubjects, iniGeneMethod = "Ttest", transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE) round(obj.gsMMD$para, 3) ## End(Not run)
## Not run: library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mat <- exprs(eSet1) mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0,nSubjects) # B3 coded as 0, T2 coded as 1 memSubjects[mem.str == "T2"] <- 1 obj.gsMMD <- gsMMD.default(mat, memSubjects, iniGeneMethod = "Ttest", transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE) round(obj.gsMMD$para, 3) ## End(Not run)
Gene selection based on the marginal distributions of gene profiles that characterized by a mixture of three-component multivariate distributions.
Input is an object derived from the class ExpressionSet
. The user needs to provide initial gene cluster membership.
gsMMD2(obj.eSet, memSubjects, memIni, maxFlag = TRUE, thrshPostProb = 0.5, geneNames = NULL, alpha = 0.05, transformFlag = FALSE, transformMethod = "boxcox", scaleFlag = TRUE, criterion = c("cor", "skewness", "kurtosis"), minL = -10, maxL = 10, stepL = 0.1, eps = 0.001, ITMAX = 100, plotFlag = FALSE, quiet=TRUE)
gsMMD2(obj.eSet, memSubjects, memIni, maxFlag = TRUE, thrshPostProb = 0.5, geneNames = NULL, alpha = 0.05, transformFlag = FALSE, transformMethod = "boxcox", scaleFlag = TRUE, criterion = c("cor", "skewness", "kurtosis"), minL = -10, maxL = 10, stepL = 0.1, eps = 0.001, ITMAX = 100, plotFlag = FALSE, quiet=TRUE)
obj.eSet |
an object derived from the class |
memSubjects |
a vector of membership of subjects. |
memIni |
a vector of user-provided gene cluster membership. |
maxFlag |
logical. Indicate how to assign gene class membership. |
thrshPostProb |
threshold for posterior probabilities. For example, if the posterior probability that a gene belongs to cluster 1 given its gene expression levels is larger than |
geneNames |
an optional character vector of gene names |
alpha |
significant level which is equal to |
transformFlag |
logical. Indicate if data transformation is needed |
transformMethod |
method for transforming data. Available methods include "boxcox", "log2", "log10", "log", "none". |
scaleFlag |
logical. Indicate if gene profiles are to be scaled. If |
criterion |
if |
minL |
lower limit for the |
maxL |
upper limit for the |
stepL |
tolerance when searching the optimal |
eps |
a small positive value. If the absolute value of a value is smaller than |
ITMAX |
maximum iteration allowed for iterations in the EM algorithm |
plotFlag |
logical. Indicate if the Box-Cox normality plot should be output. |
quiet |
logical. Indicate if intermediate results should be printed out. |
We assume that the distribution of gene expression profiles is
a mixture of 3-component multivariate normal distributions
. Each component distribution
corresponds to a gene cluster. The 3 components correspond to 3 gene clusters:
(1) up-regulated gene cluster, (2) non-differentially expressed gene cluster,
and (3) down-regulated gene cluster.
The model parameter vector is
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
.
where
,
, and
are the mixing proportions;
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 1 (up-regulated genes) for diseased subjects;
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 1 (up-regulated genes) for non-diseased subjects;
,
, and
are the marginal mean,
variance, and correlation of gene
expression levels of cluster 2 (non-differentially expressed genes);
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 3 (up-regulated genes) for diseased subjects;
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 3 (up-regulated genes) for non-diseased subjects.
Note that genes in cluster 2 are non-differentially expressed across abnormal and normal tissue samples. Hence there are only 3 parameters for cluster 2.
To make sure the identifiability, we set the following contraints:
and
.
To make sure the marginal covariance matrices are poisitive definite,
we set the following contraints:
,
,
,
,
.
We also has the following constraints for the mixing proportion:
,
,
.
We apply the EM algorithm to estimate the model parameters. We regard the cluster membership of genes as missing values.
To facilitate the estimation of the parameters,
we reparametrize the parameter vector as
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
where
,
,
,
,
,
,
.
Given a gene, the expression levels of the gene are assumed independent. However, after scaling, the scaled expression levels of the gene are no longer independent and the rank of the covariance matrix for the scaled gene profile will be one less than the rank
for the un-scaled gene profile Hence the covariance matrix of the
gene profile will no longer be positive-definite.
To avoid this problem,
we delete a tissue sample after scaling since its information has been incorrporated by other scaled tissue samples. We arbitrarily select the tissue sample, which has the biggest label number, from the tissue sample group that has larger size than the other tissue sample group. For example, if there are 6 cancer tissue samples and 10 normal tissue samples, we delete the 10-th normal tissue sample after scaling.
A list contains 13 elements.
dat |
the (transformed) microarray data matrix. If tranformation
performed, then |
memSubjects |
the same as the input |
memGenes |
a vector of cluster membership of genes. |
memGenes2 |
an variant of the vector of cluster membership of genes.
|
para |
parameter estimates (c.f. details). |
llkh |
value of the loglikelihood function. |
wiMat |
posterior probability that a gene belongs to a cluster given the expression levels of this gene. Column i is for cluster i. |
memIni |
the initial cluster membership of genes. |
paraIni |
the parameter estimates based on initial gene cluster membership. |
llkhIni |
the value of loglikelihood function. |
lambda |
the parameter used to do Box-Cox transformation |
paraRP |
parameter estimates for reparametrized parameter vector (c.f. details). |
paraIniRP |
the parameter estimates for reparametrized parameter vector based on initial gene cluster membership. |
The speed of the program can be slow for large data sets, however it has been improved using Fortran code.
Jarrett Morrow [email protected], Weiliang Qiu [email protected], Wenqing He [email protected], Xiaogang Wang [email protected], Ross Lazarus [email protected]
Qiu, W.-L., He, W., Wang, X.-G. and Lazarus, R. (2008). A Marginal Mixture Model for Selecting Differentially Expressed Genes across Two Types of Tissue Samples. The International Journal of Biostatistics. 4(1):Article 20. http://www.bepress.com/ijb/vol4/iss1/20
gsMMD
,
gsMMD.default
,
gsMMD2.default
## Not run: library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0,nSubjects) # B3 coded as 0, T2 coded as 1 memSubjects[mem.str == "T2"] <- 1 myWilcox <- function(x, memSubjects, alpha = 0.05) { xc <- x[memSubjects == 1] xn <- x[memSubjects == 0] m <- sum(memSubjects == 1) res <- wilcox.test(x = xc, y = xn, conf.level = 1 - alpha) res2 <- c(res$p.value, res$statistic - m * (m + 1) / 2) names(res2) <- c("p.value", "statistic") return(res2) } mat <- exprs(eSet1) tmp <- t(apply(mat, 1, myWilcox, memSubjects = memSubjects)) colnames(tmp) <- c("p.value", "statistic") memIni <- rep(2, nrow(mat)) memIni[tmp[, 1] < 0.05 & tmp[, 2] > 0] <- 1 memIni[tmp[, 1] < 0.05 & tmp[, 2] < 0] <- 3 cat("initial gene cluster size>>\n"); print(table(memIni)); cat("\n"); obj.gsMMD <- gsMMD2(eSet1, memSubjects, memIni, transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE, quiet = FALSE) round(obj.gsMMD$para, 3) ## End(Not run)
## Not run: library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0,nSubjects) # B3 coded as 0, T2 coded as 1 memSubjects[mem.str == "T2"] <- 1 myWilcox <- function(x, memSubjects, alpha = 0.05) { xc <- x[memSubjects == 1] xn <- x[memSubjects == 0] m <- sum(memSubjects == 1) res <- wilcox.test(x = xc, y = xn, conf.level = 1 - alpha) res2 <- c(res$p.value, res$statistic - m * (m + 1) / 2) names(res2) <- c("p.value", "statistic") return(res2) } mat <- exprs(eSet1) tmp <- t(apply(mat, 1, myWilcox, memSubjects = memSubjects)) colnames(tmp) <- c("p.value", "statistic") memIni <- rep(2, nrow(mat)) memIni[tmp[, 1] < 0.05 & tmp[, 2] > 0] <- 1 memIni[tmp[, 1] < 0.05 & tmp[, 2] < 0] <- 3 cat("initial gene cluster size>>\n"); print(table(memIni)); cat("\n"); obj.gsMMD <- gsMMD2(eSet1, memSubjects, memIni, transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE, quiet = FALSE) round(obj.gsMMD$para, 3) ## End(Not run)
Gene selection based on the marginal distributions of gene profiles that characterized by a mixture of three-component multivariate distributions. Input is a data matrix. The user needs to provide initial gene cluster membership.
gsMMD2.default(X, memSubjects, memIni, maxFlag = TRUE, thrshPostProb = 0.5, geneNames = NULL, alpha = 0.05, transformFlag = FALSE, transformMethod = "boxcox", scaleFlag = TRUE, criterion = c("cor", "skewness", "kurtosis"), minL = -10, maxL = 10, stepL = 0.1, eps = 0.001, ITMAX = 100, plotFlag = FALSE, quiet=TRUE)
gsMMD2.default(X, memSubjects, memIni, maxFlag = TRUE, thrshPostProb = 0.5, geneNames = NULL, alpha = 0.05, transformFlag = FALSE, transformMethod = "boxcox", scaleFlag = TRUE, criterion = c("cor", "skewness", "kurtosis"), minL = -10, maxL = 10, stepL = 0.1, eps = 0.001, ITMAX = 100, plotFlag = FALSE, quiet=TRUE)
X |
a data matrix. The rows of the matrix are genes. The columns of the matrix are subjects. |
memSubjects |
a vector of membership of subjects. |
memIni |
a vector of user-provided gene cluster membership. |
maxFlag |
logical. Indicate how to assign gene class membership. |
thrshPostProb |
threshold for posterior probabilities. For example, if the posterior probability that a gene belongs to cluster 1 given its gene expression levels is larger than |
geneNames |
an optional character vector of gene names |
alpha |
significant level which is equal to |
transformFlag |
logical. Indicate if data transformation is needed |
transformMethod |
method for transforming data. Available methods include "boxcox", "log2", "log10", "log", "none". |
scaleFlag |
logical. Indicate if gene profiles are to be scaled. If |
criterion |
if |
minL |
lower limit for the |
maxL |
upper limit for the |
stepL |
tolerance when searching the optimal |
eps |
a small positive value. If the absolute value of a value is smaller than |
ITMAX |
maximum iteration allowed for iterations in the EM algorithm |
plotFlag |
logical. Indicate if the Box-Cox normality plot should be output. |
quiet |
logical. Indicate if intermediate results should be printed out. |
We assume that the distribution of gene expression profiles is
a mixture of 3-component multivariate normal distributions
. Each component distribution
corresponds to a gene cluster. The 3 components correspond to 3 gene clusters:
(1) up-regulated gene cluster, (2) non-differentially expressed gene cluster,
and (3) down-regulated gene cluster.
The model parameter vector is
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
.
where
,
, and
are the mixing proportions;
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 1 (up-regulated genes) for diseased subjects;
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 1 (up-regulated genes) for non-diseased subjects;
,
, and
are the marginal mean,
variance, and correlation of gene
expression levels of cluster 2 (non-differentially expressed genes);
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 3 (up-regulated genes) for diseased subjects;
,
, and
are
the marginal mean, variance, and correlation of gene expression levels
of cluster 3 (up-regulated genes) for non-diseased subjects.
Note that genes in cluster 2 are non-differentially expressed across abnormal and normal tissue samples. Hence there are only 3 parameters for cluster 2.
To make sure the identifiability, we set the following contraints:
and
.
To make sure the marginal covariance matrices are poisitive definite,
we set the following contraints:
,
,
,
,
.
We also has the following constraints for the mixing proportion:
,
,
.
We apply the EM algorithm to estimate the model parameters. We regard the cluster membership of genes as missing values.
To facilitate the estimation of the parameters,
we reparametrize the parameter vector as
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
where
,
,
,
,
,
,
.
Given a gene, the expression levels of the gene are assumed independent. However, after scaling, the scaled expression levels of the gene are no longer independent and the rank of the covariance matrix for the scaled gene profile will be one less than the rank
for the un-scaled gene profile Hence the covariance matrix of the
gene profile will no longer be positive-definite.
To avoid this problem,
we delete a tissue sample after scaling since its information has been incorrporated by other scaled tissue samples. We arbitrarily select the tissue sample, which has the biggest label number, from the tissue sample group that has larger size than the other tissue sample group. For example, if there are 6 cancer tissue samples and 10 normal tissue samples, we delete the 10-th normal tissue sample after scaling.
A list contains 13 elements.
dat |
the (transformed) microarray data matrix. If tranformation
performed, then |
memSubjects |
the same as the input |
memGenes |
a vector of cluster membership of genes. |
memGenes2 |
an variant of the vector of cluster membership of genes.
|
para |
parameter estimates (c.f. details). |
llkh |
value of the loglikelihood function. |
wiMat |
posterior probability that a gene belongs to a cluster given the expression levels of this gene. Column i is for cluster i. |
memIni |
the initial cluster membership of genes. |
paraIni |
the parameter estimates based on initial gene cluster membership. |
llkhIni |
the value of loglikelihood function. |
lambda |
the parameter used to do Box-Cox transformation |
paraRP |
parameter estimates for reparametrized parameter vector (c.f. details). |
paraIniRP |
the parameter estimates for reparametrized parameter vector based on initial gene cluster membership. |
The speed of the program can be slow for large data sets, however it has been improved using Fortran code.
Jarrett Morrow [email protected], Weiliang Qiu [email protected], Wenqing He [email protected], Xiaogang Wang [email protected], Ross Lazarus [email protected]
Qiu, W.-L., He, W., Wang, X.-G. and Lazarus, R. (2008). A Marginal Mixture Model for Selecting Differentially Expressed Genes across Two Types of Tissue Samples. The International Journal of Biostatistics. 4(1):Article 20. http://www.bepress.com/ijb/vol4/iss1/20
## Not run: library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mat <- exprs(eSet1) mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0, nSubjects) # B3 coded as 0, T2 coded as 1 memSubjects[mem.str == "T2"] <- 1 myWilcox <- function(x, memSubjects, alpha = 0.05) { xc <- x[memSubjects == 1] xn <- x[memSubjects == 0] m <- sum(memSubjects == 1) res <- wilcox.test(x = xc, y = xn, conf.level = 1 - alpha) res2 <- c(res$p.value, res$statistic - m * (m + 1) / 2) names(res2) <- c("p.value", "statistic") return(res2) } tmp <- t(apply(mat, 1, myWilcox, memSubjects = memSubjects)) colnames(tmp) <- c("p.value", "statistic") memIni <- rep(2, nrow(mat)) memIni[tmp[, 1] < 0.05 & tmp[, 2] > 0] <- 1 memIni[tmp[, 1] < 0.05 & tmp[,2] < 0] <- 3 cat("initial gene cluster size>>\n"); print(table(memIni)); cat("\n"); obj.gsMMD <- gsMMD2.default(mat, memSubjects, memIni = memIni, transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE) round(obj.gsMMD$para, 3) ## End(Not run)
## Not run: library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mat <- exprs(eSet1) mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0, nSubjects) # B3 coded as 0, T2 coded as 1 memSubjects[mem.str == "T2"] <- 1 myWilcox <- function(x, memSubjects, alpha = 0.05) { xc <- x[memSubjects == 1] xn <- x[memSubjects == 0] m <- sum(memSubjects == 1) res <- wilcox.test(x = xc, y = xn, conf.level = 1 - alpha) res2 <- c(res$p.value, res$statistic - m * (m + 1) / 2) names(res2) <- c("p.value", "statistic") return(res2) } tmp <- t(apply(mat, 1, myWilcox, memSubjects = memSubjects)) colnames(tmp) <- c("p.value", "statistic") memIni <- rep(2, nrow(mat)) memIni[tmp[, 1] < 0.05 & tmp[, 2] > 0] <- 1 memIni[tmp[, 1] < 0.05 & tmp[,2] < 0] <- 3 cat("initial gene cluster size>>\n"); print(table(memIni)); cat("\n"); obj.gsMMD <- gsMMD2.default(mat, memSubjects, memIni = memIni, transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE) round(obj.gsMMD$para, 3) ## End(Not run)
Replace expression levels by the residuals of regression analysis in which predictor of interest is not in the regression model. The purpose of this function is to remove potential confounding factors.
obtainResi(es, fmla)
obtainResi(es, fmla)
es |
An |
fmla |
A formula object that specifies the covariates of the linear
regression model. The variable of interest should not be
included. No response variable should be specified in |
To remove confounding effects, we can replace the expression level by
the residuals
of a linear regression model with
response variable the expression level and covariates the potential
confounders. The functions lmFit
and eBayes
will
be used to obtain regression coefficients.
An ExpressionSet object with expression levels replaced by residuals of linear regression analysis.
The number of arrays of the returned ExpressionSet object might be smaller than that of the original ExpressionSet object, due to missing values in covariates.
Jarrett Morrow [email protected], Weiliang Qiu [email protected], Wenqing He [email protected], Xiaogang Wang [email protected], Ross Lazarus [email protected]
Plot of histogram of pooled gene expression levels, composited with density estimate based on the mixture of marginal distributions. The density estimate is based on the assumption that the marginal correlations between subjects are zero.
plotHistDensity(obj.gsMMD, plotFlag = "case", plotComponent = FALSE, myxlab = "expression level", myylab = "density", mytitle = "Histogram with estimated density (case)", x.legend = NULL, y.legend = NULL, numPoints = 500, mycol = 1:4, mylty = 1:4, mylwd = rep(3,4), cex.main = 2, cex.lab = 1.5, cex.axis = 1.5, cex = 2, bty = "n")
plotHistDensity(obj.gsMMD, plotFlag = "case", plotComponent = FALSE, myxlab = "expression level", myylab = "density", mytitle = "Histogram with estimated density (case)", x.legend = NULL, y.legend = NULL, numPoints = 500, mycol = 1:4, mylty = 1:4, mylwd = rep(3,4), cex.main = 2, cex.lab = 1.5, cex.axis = 1.5, cex = 2, bty = "n")
obj.gsMMD |
an object returned by |
plotFlag |
logical. Indicate the plot will based on which type of subjects. |
plotComponent |
logical. Indicate if components of the mixture of marginal distribution will be plotted. |
myxlab |
label for x-axis |
myylab |
label for y-axis |
mytitle |
title of the plot |
x.legend |
the x-corrdiates of the legend |
y.legend |
the y-corrdiates of the legend |
numPoints |
logical. Indicate how many genes will be plots. |
mycol |
color for the density estimates (overall and components) |
mylty |
line styles for the density estimates (overall and components) |
mylwd |
line width for the density estimates (overall and components) |
cex.main |
font for main title |
cex.lab |
font for x- and y-axis labels |
cex.axis |
font for x- and y-axis |
cex |
font for texts |
bty |
the type of box to be drawn around the legend. The allowed values are '"o"' and '"n"' (the default). |
For a given type of subjects, we pool their expression levels together if the marginal correlations among subjects are zero. We then draw a histogram of the pooled expression levels. Next, we composite density estimates of gene expression levels for the overal distribution and the 3 component distributions.
A list containing coordinates of the density estimates:
x |
sorted pooled gene expression levels for cases or controls. |
x2 |
a subset of |
y |
density estimate corresponding to |
y1 |
weighted density estimate for gene cluster 1 |
y2 |
weighted density estimate for gene cluster 2 |
y3 |
weighted density estimate for gene cluster 3 |
The density estimate is obtained based on the
assumption that the marginal correlation among
subjects is zero. If the estimated marginal correlation obtained by gsMMD
is far from zero, then do not use this plot function.
Jarrett Morrow [email protected], Weiliang Qiu [email protected], Wenqing He [email protected], Xiaogang Wang [email protected], Ross Lazarus [email protected]
Qiu, W.-L., He, W., Wang, X.-G. and Lazarus, R. (2008). A Marginal Mixture Model for Selecting Differentially Expressed Genes across Two Types of Tissue Samples. The International Journal of Biostatistics. 4(1):Article 20. http://www.bepress.com/ijb/vol4/iss1/20
## Not run: library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0,nSubjects) # B3 coded as 0 (control), T2 coded as 1 (case) memSubjects[mem.str == "T2"] <- 1 obj.gsMMD <- gsMMD(eSet1, memSubjects, transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE, quiet = FALSE) plotHistDensity(obj.gsMMD, plotFlag = "case", mytitle = "Histogram of for T2 imposed with estimated density (case)", plotComponent = TRUE, x.legend = c(0.8, 3), y.legend = c(0.3, 0.4), numPoints = 500) ## End(Not run)
## Not run: library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0,nSubjects) # B3 coded as 0 (control), T2 coded as 1 (case) memSubjects[mem.str == "T2"] <- 1 obj.gsMMD <- gsMMD(eSet1, memSubjects, transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE, quiet = FALSE) plotHistDensity(obj.gsMMD, plotFlag = "case", mytitle = "Histogram of for T2 imposed with estimated density (case)", plotComponent = TRUE, x.legend = c(0.8, 3), y.legend = c(0.3, 0.4), numPoints = 500) ## End(Not run)