Title: | Testing Groups of Covariates/Features for Association with a Response Variable, with Applications to Gene Set Testing |
---|---|
Description: | The global test tests groups of covariates (or features) for association with a response variable. This package implements the test with diagnostic plots and multiple testing utilities, along with several functions to facilitate the use of this test for gene set testing of GO and KEGG terms. |
Authors: | Jelle Goeman and Jan Oosting, with contributions by Livio Finos, Aldo Solari, Dominic Edelmann |
Maintainer: | Jelle Goeman <[email protected]> |
License: | GPL (>= 2) |
Version: | 5.61.0 |
Built: | 2024-11-20 06:25:40 UTC |
Source: | https://github.com/bioc/globaltest |
Comparing the result of a global test performed on a subset to the test results of random subsets of the same size.
comparative(object, N = 1e3, z.scores = TRUE, trace)
comparative(object, N = 1e3, z.scores = TRUE, trace)
object |
A |
N |
The number of random subsets to generate. |
z.scores |
If set to |
trace |
If set to |
In a situation when many covariates out of a large set are associated with the response, it is sometimes interesting to know p-value of the tested subset compares to random subsets of the same size. The comparative
function calculates the proportion of random subsets of the covariates of the same size as the tested subset that have a better score than the tested subset. This proportion is a diagnostic tool to help interpret the test result; it should not be interpreted as a p-value.
An object of class gt.object
with an appropriate column added to the test results matrix.
Jelle Goeman: [email protected]; Jan Oosting
The comparative proportion is an enrichment type analysis. For the pros and cons of such an analysis, see
Goeman and Buhlmann (2007) Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23 (8) 980-987.
The gt
function. The gt.object
function and useful functions associated with that object.
# Simple examples with random data here # Real data examples in the Vignette # Random data: covariates A,B,C are correlated with Y set.seed(1) Y <- rnorm(20) X <- matrix(rnorm(200), 20, 10) X[,1:3] <- X[,1:3] + Y colnames(X) <- LETTERS[1:10] # Some subsets of interest my.sets <- list(c("A", "B"), c("C","D"), c("D", "E")) # Comparative proportions res <- gt(Y, X, subsets = my.sets) comparative(res) comparative(res, z.scores=FALSE)
# Simple examples with random data here # Real data examples in the Vignette # Random data: covariates A,B,C are correlated with Y set.seed(1) Y <- rnorm(20) X <- matrix(rnorm(200), 20, 10) X[,1:3] <- X[,1:3] + Y colnames(X) <- LETTERS[1:10] # Some subsets of interest my.sets <- list(c("A", "B"), c("C","D"), c("D", "E")) # Comparative proportions res <- gt(Y, X, subsets = my.sets) comparative(res) comparative(res, z.scores=FALSE)
Plots to visualize the result of a Global Test in terms of the contributions of the covariates and the subjects.
covariates(object, what = c("p-value", "statistic", "z-score", "weighted"), cluster = "average", alpha = 0.05, sort = TRUE, zoom = FALSE, legend = TRUE, plot = TRUE, colors, alias, help.lines = FALSE, cex.labels = 0.6, ylim, pdf, trace) features(...) subjects(object, what = c("p-value", "statistic", "z-score", "weighted"), cluster = "average", sort = TRUE, mirror = TRUE, legend = TRUE, colors, alias, help.lines = FALSE, cex.labels = 0.6, ylim, pdf)
covariates(object, what = c("p-value", "statistic", "z-score", "weighted"), cluster = "average", alpha = 0.05, sort = TRUE, zoom = FALSE, legend = TRUE, plot = TRUE, colors, alias, help.lines = FALSE, cex.labels = 0.6, ylim, pdf, trace) features(...) subjects(object, what = c("p-value", "statistic", "z-score", "weighted"), cluster = "average", sort = TRUE, mirror = TRUE, legend = TRUE, colors, alias, help.lines = FALSE, cex.labels = 0.6, ylim, pdf)
object |
A |
what |
Gives a choice between various presentations of the same plot. See below under details. |
cluster |
The type of hierarchical clustering performed for the dendrogram. Default is average linkage clustering. For other options, see |
alpha |
Parameter between 0 and 1. Sets the level of family-wise error control in the multiple testing procedure performed on the dendrogram. See below under details. |
sort |
If |
zoom |
If |
legend |
If |
plot |
If |
colors |
The colors to be used for the bars. See |
alias |
Optional alternative labels for the bars in the plots. Should be a character vector of the same length as the number of covariates or subjects, respectively. |
help.lines |
If |
cex.labels |
Magnification factor for the x-axis labels. |
ylim |
Override for the y axis limits of the barplot. |
pdf |
Optional filename ( |
trace |
If |
mirror |
If |
... |
All arguments of |
These two diagnostic plots decompose the test statistics into the contributions of the different covariates and subjects to make the influence of these covariates and subjects visible.
The covariates
plot exploits the fact that the global test statistic for a set of alternative covariates can be written as a weighted sum of the global test statistics for each single contributing covariate. By displaying these component global test results in a bar plot the covariates
plot gives insight into the subset of covariates that is most responsible for the significant test result. The plot can show the p-values
of the component tests on a reversed log scale (the default); their test statistics
, with stripes showing their mean and standard deviation under the null hypothesis; the z-scores
of these test statistics, standardized to mean zero and standard deviation one; or the weighted
test statistics, where the test statistics are multiplied by the relative weight that each covariate carries in the overall test. See the Vignette for more details.
The dendrogram of the covariates
plot is based on correlation distance if the directional
argument was set to TRUE
in the call to gt
, and uses absolute correlation distance otherwise. The coloring of the dendrogram is based on the multiple testing procedure of Meinshausen (2008): this procedure controls the family-wise error rate on all 2n-1
hypotheses associated with the subsets of covariates induced by the clustering graph. All significant subsets are colored black; non-significant ones remain grey. This coloring serves as an additional aid to find the subsets of the covariates most contributing to a significant test result.
The features
function is a synonym for covariates
, using exactly the same arguments.
The subjects
plot exploits the fact that the global test can be written as a sum of contributions of each individual. Each of these contributions is itself a test statistic for the same null hypothesis as the full global test, but one which puts a greater weight on the observed information of a specific subject. These test statistic of subject i
is significant if, for the other subjects, similarity in the alternative covariates to subject i
tends to coincide with similarity in residual response to subject i
. Like the covariates
plot, the subjects
plot can show the p-values
of these component tests on a reversed log scale (the default); their test statistics
, with stripes showing their mean and standard deviation under the null hypothesis; the z-scores
of these test statistics, standardized to mean zero and standard deviation one; or the weighted
test statistics, where the test statistics are multiplied by the relative weight that each covariate carries in the overall test. Setting mirror=FALSE
reverses the bars of subjects with a negative residual response (not applicable if p-values
are plotted). The resulting statistics
values have the additional interpretation that they are proportional to the first order estimates of the linear predictors of each subject under the alternative, i.e. subjects with positive values have higher means under the alternative than under the null, and subjects with negative values have lower means under the alternative than under the null. See the Vignette for more details.
The dendrogram of the subjects
plot is always based on correlation distance. There is no analogue to Meinshausen's multiple testing method for this dendrogram, so multiple testing is not performed.
If called to make a single plot, the covariates
function returns an object of class gt.object
. Several methods are available to access this object: see gt.object
. The subjects
function returns a matrix. If called to make multiple plots, both functions return NULL
.
The term "z-score" is not meant to imply a normal distribution, but just refers to a studentized score. The z-scores of the subjects
plot are asymptotically normal under the null hypothesis; the z-scores of the covariates
plot are asymptotically distributed as a chi-squared variable with one degree of freedom.
Jelle Goeman: [email protected]; Livio Finos.
General theory and properties of the global test are described in
Goeman, Van de Geer and Van Houwelingen (2006) Journal of the Royal Statistical Society, Series B 68 (3) 477-493.
Meinshausen's method for multiple testing
Meinshausen (2008) Biometrika 95 (2) 265-278.
For more references related to applications of the test, see the vignette GlobalTest.pdf included with this package.
Diagnostic plots: covariates
, subjects
.
The gt.object
function and useful functions associated with that object.
Many more examples in the vignette!
# Simple examples with random data here # Real data examples in the Vignette # Random data: covariates A,B,C are correlated with Y set.seed(1) Y <- rnorm(20) X <- matrix(rnorm(200), 20, 10) X[,1:3] <- X[,1:3] + Y colnames(X) <- LETTERS[1:10] # Preparation: test res <- gt(Y,X) # Covariates covariates(res) covariates(res, what = "w") covariates(res, zoom = TRUE) # Subjects subjects(res) subjects(res, what = "w", mirror = FALSE) # Change legend, colors or labels covariates(res, legend = c("upregulated", "downregulated")) covariates(res, col = rainbow(2)) covariates(res, alias = letters[1:10]) # Extract data from the plot out <- covariates(res) result(out) extract(out)
# Simple examples with random data here # Real data examples in the Vignette # Random data: covariates A,B,C are correlated with Y set.seed(1) Y <- rnorm(20) X <- matrix(rnorm(200), 20, 10) X[,1:3] <- X[,1:3] + Y colnames(X) <- LETTERS[1:10] # Preparation: test res <- gt(Y,X) # Covariates covariates(res) covariates(res, what = "w") covariates(res, zoom = TRUE) # Subjects subjects(res) subjects(res, what = "w", mirror = FALSE) # Change legend, colors or labels covariates(res, legend = c("upregulated", "downregulated")) covariates(res, col = rainbow(2)) covariates(res, alias = letters[1:10]) # Extract data from the plot out <- covariates(res) result(out) extract(out)
Tests a low-dimensional null hypothesis against a potentially high-dimensional alternative in regression models (linear regression, logistic regression, poisson regression, Cox proportional hazards model).
gt(response, alternative, null, data, test.value, model = c("linear", "logistic", "cox", "poisson", "multinomial"), levels, directional = FALSE, standardize = FALSE, permutations = 0, subsets, weights, alias, x = FALSE, trace)
gt(response, alternative, null, data, test.value, model = c("linear", "logistic", "cox", "poisson", "multinomial"), levels, directional = FALSE, standardize = FALSE, permutations = 0, subsets, weights, alias, x = FALSE, trace)
response |
The response vector of the regression model. May be
supplied as a vector or as a |
alternative |
The part of the design matrix corresponding to
the alternative hypothesis. The covariates of the null model do
not have to be supplied again here. May be given as a half
|
null |
The part of the design matrix corresponding to the null hypothesis. May be given as a design matrix or as a half |
).
data |
Only used when |
test.value |
An optional vector regression coefficients to test. The default is to test the null hypothesis that all regression coefficients of the covariates of the alternative are zero. The |
model |
The type of regression model to be tested. If omitted, the function will try to determine the model from the class and values of the |
levels |
Only used if response is |
directional |
If set to |
standardize |
If set to |
permutations |
The number of permutations to use. The default, |
subsets |
Optional argument that can be used to test one or more subsets of the covariates in |
weights |
Optional argument that can be used to give certain covariates in |
alias |
Optional second label for each test. Should be a vector of the same length as |
x |
If |
trace |
If |
The Global Test tests a low-dimensional null hypothesis against a (potentially) high-dimensional alternative, using the locally most powerful test of Goeman et al (2006). In this regression model implementation, it tests the null hypothesis response ~ null
, that the covariates in alternative
are not associated with the response, against the alternative model response ~ null + alternative
that they are.
The test has a wide range of applications. In gene set testing in microarray data analysis alternative
may be a matrix of gene expression measurements, and the aim is to find which of a collection of predefined subsets
of the genes (e.g. Gene Ontology terms or KEGG pathways) is most associated with the response
. In penalized regression or other machine learning techniques, alternative
may be a collection of predictor variables that may be used to predict a response
, and the test may function as a useful pre-test to see if training the classifier is worthwhile. In goodness-of-fit testing, null
may be a model with linear terms fitted to the response
, and alternative
may be a large collection of non-linear terms. The test may be used in this case to test the fit of the null model with linear terms against a non-linear alternative.
See the vignette for extensive examples of these applications.
The function returns an object of class gt.object
. Several operations and diagnostic plots can be made from this object. See also Diagnostic plots.
If null
is supplied as a formula
object, an intercept is automatically included. As a consequence gt(Y, X, Z)
will usually give a different result from gt(Y, X, ~Z)
. The first call is equivalent to gt(Y, X, ~0+Z)
, whereas the second call is equivalent to gt(Y, X, cbind(1,Z))
.
P-values from the asymptotic distribution are accurate to at least two decimal places up to a value of around 1e-12
. Lower p-values are numerically less reliable.
Missing values are allowed in the alternative
matrix only. Missing values are imputed conservatively (i.e. under the null hypothesis). Covariates with many missing values get reduced variance and therefore automatically carry less weight in the test result.
Jelle Goeman: [email protected]; Jan Oosting
General theory and properties of the global test are described in
Goeman, Van de Geer and Van Houwelingen (2006) Journal of the Royal Statistical Society, Series B 68 (3) 477-493.
For references related to applications of the test, see the vignette GlobalTest.pdf included with this package.
Diagnostic plots: covariates
, subjects
.
The gt.object
function and useful functions associated with that object.
Many more examples in the vignette!
# Simple examples with random data here # Real data examples in the Vignette # Random data: covariates A,B,C are correlated with Y set.seed(1) Y <- rnorm(20) X <- matrix(rnorm(200), 20, 10) X[,1:3] <- X[,1:3] + Y colnames(X) <- LETTERS[1:10] # Compare the global test with the F-test gt(Y, X) anova(lm(Y~X)) # Using formula input res <- gt(Y, ~A+B, null=~C+E, data=data.frame(X)) summary(res) # Beware: null models with and without intercept Z <- rnorm(20) summary(gt(Y, X, null=~Z)) summary(gt(Y, X, null=Z)) # Logistic regression gt(Y>0, X) # Subsets and weights (1) my.sets <- list(c("A", "B"), c("C","D"), c("D", "E")) gt(Y, X, subsets = my.sets) my.weights <- list(1:2, 2:1, 3:2) gt(Y, X, subsets = my.sets, weights=my.weights) # Subsets and weights (2) gt(Y, X, subset = c("A", "B")) gt(Y, X, subset = c("A", "A", "B")) gt(Y, X, subset = c("A", "A", "B"), weight = c(.5,.5,1)) # Permutation testing summary(gt(Y, X, perm=1e4))
# Simple examples with random data here # Real data examples in the Vignette # Random data: covariates A,B,C are correlated with Y set.seed(1) Y <- rnorm(20) X <- matrix(rnorm(200), 20, 10) X[,1:3] <- X[,1:3] + Y colnames(X) <- LETTERS[1:10] # Compare the global test with the F-test gt(Y, X) anova(lm(Y~X)) # Using formula input res <- gt(Y, ~A+B, null=~C+E, data=data.frame(X)) summary(res) # Beware: null models with and without intercept Z <- rnorm(20) summary(gt(Y, X, null=~Z)) summary(gt(Y, X, null=Z)) # Logistic regression gt(Y>0, X) # Subsets and weights (1) my.sets <- list(c("A", "B"), c("C","D"), c("D", "E")) gt(Y, X, subsets = my.sets) my.weights <- list(1:2, 2:1, 3:2) gt(Y, X, subsets = my.sets, weights=my.weights) # Subsets and weights (2) gt(Y, X, subset = c("A", "B")) gt(Y, X, subset = c("A", "A", "B")) gt(Y, X, subset = c("A", "A", "B"), weight = c(.5,.5,1)) # Permutation testing summary(gt(Y, X, perm=1e4))
A collection of procedures for performing the Global Test on gene set databases. Three function are provided for KEGG, for Gene Ontology and for the Broad Institute's gene sets.
gtKEGG (response, exprs, ..., id, annotation, probe2entrez, multtest = c("Holm", "BH", "BY"), sort = TRUE) gtGO (response, exprs, ..., id, annotation, probe2entrez, ontology = c("BP", "CC", "MF"), minsize=1, maxsize=Inf, multtest = c("Holm", "focuslevel", "BH", "BY"), focuslevel = 10, sort = TRUE) gtConcept (response, exprs, ..., annotation, probe2entrez, conceptmatrix, concept2name = "conceptID2name.txt", entrez2concept = "entrezGeneToConceptID.txt", threshold = 1e-4, share = TRUE, multtest = c("Holm", "BH", "BY"), sort = TRUE) gtBroad (response, exprs, ..., id, annotation, probe2entrez, collection, category = c("c1", "c2", "c3", "c4", "c5"), multtest = c("Holm", "BH", "BY"), sort = TRUE)
gtKEGG (response, exprs, ..., id, annotation, probe2entrez, multtest = c("Holm", "BH", "BY"), sort = TRUE) gtGO (response, exprs, ..., id, annotation, probe2entrez, ontology = c("BP", "CC", "MF"), minsize=1, maxsize=Inf, multtest = c("Holm", "focuslevel", "BH", "BY"), focuslevel = 10, sort = TRUE) gtConcept (response, exprs, ..., annotation, probe2entrez, conceptmatrix, concept2name = "conceptID2name.txt", entrez2concept = "entrezGeneToConceptID.txt", threshold = 1e-4, share = TRUE, multtest = c("Holm", "BH", "BY"), sort = TRUE) gtBroad (response, exprs, ..., id, annotation, probe2entrez, collection, category = c("c1", "c2", "c3", "c4", "c5"), multtest = c("Holm", "BH", "BY"), sort = TRUE)
response |
The response variable of the regression model. This is passed on to the |
exprs |
The expression measurements. May be
|
... |
Any other arguments are also passed on to |
id |
The identifier(s) of gene sets to be tested (character vector). If omitted, tests all gene sets in the database. |
annotation |
The name of the probe annotation package for the microarray that was used, or the name of the genome wide annotation package for the species (e.g. org.Hs.eg.db for human). If an organism package is given, the argument |
probe2entrez |
Use only if no probe annotation package is available. A mapping from probe identifiers to entrez gene ids. May be an environment, named list or named vector. |
multtest |
The method of multiple testing correction. Choose from: Benjamini and Hochberg FDR control (BH); Benjamini and Yekutieli FDR control (BY) or Holm familywise error fontrol (Holm). For |
sort |
If |
ontology |
The ontology or ontologies to be used. Default is to use all three ontologies. |
minsize |
The minimum number of probes that may be annotated to a gene set. Gene sets with fewer annotated probes are discarded. |
maxsize |
The maximum number of probes that may be annotated to a gene set. Gene sets with more annotated probes are discarded. |
focuslevel |
The focus level to be used for the focus level method. Either a vector of gene set ids, or a numerical level. In the latter case, |
collection |
The Broad gene set collection, created by a call to |
conceptmatrix |
The name of the file containing the importance weights, i.e. concept profile associations between Anni concepts. In the matrix contained in the file, columns correspond to testable concepts, and rows correspond to entrez-concepts. Useable files can be downloaded from http://biosemantics.org/index.php/software/weighted-global-test. |
concept2name |
The name of the file containing a mapping between Anni concepts and entrez identifiers. Useable files can be downloaded from http://biosemantics.org/index.php/software/weighted-global-test. |
entrez2concept |
The name of the file containing a mapping between Anni concept numbers and names. Useable files can be downloaded from http://biosemantics.org/index.php/software/weighted-global-test. |
threshold |
The relevance threshold for importance weights. Importance weights below the threshold are treated as zero. |
share |
If |
category |
The subcategory of the Broad collection to be tested. The default is to test all sets. |
These are utility functions to make it easier to do gene set testing of gene sets available in gene set databases. The functions automatically retrieve the gene sets, preprocess and select them, perform global test, do multiple testing correction, and sort the results on the basis of their p-values.
The four functions use different databases for testing. gtKEGG
and gtGO
use KEGG (http://www.genome.jp/kegg) and GO (http://www.geneontology.org); gtConcept
uses the Anni database (http://www.biosemantics.org/anni), and gtBroad
uses the MSigDB database (http://www.broadinstitute.org/gsea/msigdb). The gtConcept
function differs from the other three in that it uses association weights between 0 and 1 for genes within sets, rather than having a hard cut-off for membership of a gene in a set.
All functions require that annotate
and the appropriate annotation packages are installed. gtKEGG
additionally requires the KEGG.db
package; gtGO
requires the GO.db
package; gtBroad
requires the user to download the XML file "msigdb_v2.5.xml"
from \ http://www.broad.mit.edu/gsea/downloads.jsp
, and to preprocess that file using the getBroadSets
function. gtConcept
requires files that can be downloaded from http://biosemantics.org/index.php/software/weighted-global-test.
A gt.object
.
Jelle Goeman: [email protected]; Jan Oosting
Goeman, Van de Geer, De Kort and Van Houwelingen (2004). A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20 (1) 93-99.
Goeman, Oosting, Cleton-Jansen, Anninga and Van Houwelingen (2005). Testing association of a pathway with survival using gene expression data. Bioinformatics 21 (9) 1950-1957.
The gt
function. The gt.object
and useful functions associated with that object.
# Examples in the Vignette
# Examples in the Vignette
Tests the goodness of fit of a regression model against a specified alternative using the Global Test.
Three main functions are provided: gtPS
uses Penalized Splines, gtKS
uses Kernel Smoothers and gtLI
uses Linear Interactions. The other functions are for external use in combination with gt
.
gtPS(response, null, data, model = c("linear", "logistic", "cox", "poisson", "multinomial"), ..., covs, bdeg = 3, nint= 10, pord = 2, interact = FALSE, robust = FALSE, termlabels = FALSE, returnZ = FALSE) gtKS(response, null, data, model = c("linear", "logistic", "cox", "poisson", "multinomial"), ..., covs, quant = .25, metric = c("euclidean", "pearson"), kernel=c("uniform", "exponential", "triangular", "neighbours", "gauss"), robust = FALSE, scale = TRUE, termlabels = FALSE, returnZ = FALSE) gtLI(response, null, data, ..., covs, iorder=2, termlabels = FALSE, standardize = FALSE) bbase(x, bdeg, nint) btensor(xs, bdeg, nint, pord, returnU=FALSE) reparamZ(Z, pord, K=NULL, tol = 1e-10, returnU=FALSE) reweighZ(Z, null.fit) sterms(object, ...)
gtPS(response, null, data, model = c("linear", "logistic", "cox", "poisson", "multinomial"), ..., covs, bdeg = 3, nint= 10, pord = 2, interact = FALSE, robust = FALSE, termlabels = FALSE, returnZ = FALSE) gtKS(response, null, data, model = c("linear", "logistic", "cox", "poisson", "multinomial"), ..., covs, quant = .25, metric = c("euclidean", "pearson"), kernel=c("uniform", "exponential", "triangular", "neighbours", "gauss"), robust = FALSE, scale = TRUE, termlabels = FALSE, returnZ = FALSE) gtLI(response, null, data, ..., covs, iorder=2, termlabels = FALSE, standardize = FALSE) bbase(x, bdeg, nint) btensor(xs, bdeg, nint, pord, returnU=FALSE) reparamZ(Z, pord, K=NULL, tol = 1e-10, returnU=FALSE) reweighZ(Z, null.fit) sterms(object, ...)
response |
The response vector of the regression model. May be
supplied as a vector, as a |
null |
The null design matrix. May be given as a matrix or as a half |
data |
Only used when |
model |
The type of regression model to be tested. If omitted, the function will try to determine the model from the class and values of the |
... |
Any other arguments are also passed on to |
covs |
A variable or a vector of variables that are the covariates the smooth terms are function of. |
bdeg |
A vector or a list of vectors which specifies the degree of the B-spline basis, with default |
nint |
A vector or a list of vectors which specifies the number of intervals determined by equally-spaced knots, with default |
pord |
A vector or a list of vectors which specifies the order of the differences indicating the type of the penalty imposed to the coefficients, with default |
interact |
|
termlabels |
|
robust |
|
returnZ |
|
quant |
The smoothing bandwidth to be used, expressed as the percentile of the distribution of distance between observations, with default the 25th percentile. To investigate the sensitivity to different choices, |
metric |
A character string specifying the metric to be used. The available options are "euclidean" (the default), "pearson" and "mixed" (to be implemented). "mixed" distance is chosen automatically if some of the selected covariates are not numeric. |
kernel |
A character string giving the smoothing kernel to be used. This must be one of "uniform", "exponential", "triangular", "neighbours", or "Gauss", with default "uniform". |
scale |
|
iorder |
Order of the linear interactions, e.g. second order interactions, third order etc., with default |
standardize |
TRUE standardizes all covariates of the alternative to have unit second central moment. This makes sure that the test result is independent of the relative scaling of the covariates. |
x |
A numeric vector of values at which to evaluate the B-spline basis. |
xs |
A matrix or dataframe where the columns correspond to covariates values. |
returnU |
codeTRUE gives back the nonpenalized part. |
Z |
Alternative design matrix. |
K |
Penalty matrix (i.e. the penalty term is the quadratic form of K and the spline coefficients). |
tol |
Eigenvalues smaller than |
null.fit |
Fitted null model. |
object |
These are functions to test for specific types of lack of fit by using the Global Test.
Suppose that we are concerned with the adequacy of some regression model response ~ null
, such as Y ~ X1 + X2
.
The alternative model can be cast into the generic form response ~ null + alternative
, which comprises different models that accomodate to different types of lack of fit. Thus, the specification of alternative
is required. It identifies the type of lack of fit the test is directed against.
By using gtPS
, the alternative is given by a user specified sum of smooth functions of continuous covariates,
e.g. alternative= ~ s(X1)
when covs="X1"
and alternative= ~ s(X1) + s(X2)
when covs=c("X1","X2")
.
Smooth terms are constructed using P-splines as proposed by Eilers and Marx (1996). This approach consists in constructing
a B-spline basis of degree bdeg
with nint + 1
equidistant knots, where a difference penalty of order pord
is applied to the basis coefficients. If interact=TRUE
, the alternative is given by a multidimensional smooth function of covs
, which is represented by a tensor product of marginal B-splines bases and Kronecker sum of the marginal penalties, e.g. alternative= ~ s(X1,X2)
when covs=c("X1","X2")
and interact=TRUE
.
By using gtKS
the alternative is given by a user specified multidimensional smooth term, e.g. alternative= ~ s(X1, X2)
when covs=c("X1","X2")
.
Multidimensional smooth terms are represented by a kernel smoother defined by a distance measure (metric
), a kernel shape (kernel
) and a bandwidth (quant
).
Because the test is sensitive to the chosen value of quant
, it is possible to specify quant
as a vector of different values in combination with robust=TRUE
.
Distance measures for factor covariates and for the situation that both continuous and factor covariates are present are constructed as in le Cessie and van Houwelingen (1995), e.g. covs=c("X1","X2")
and distance="mixed"
when X1
continuous and X2
factor (to be implemented).
By using gtLI
, the alternative is given by all the possible ith-order linear interactions between covs
, e.g. alternative= ~ X1:X2 + X1:X3 + X2:X3
when covs=c("X1","X2","X3")
and iorder=2
.
The remaining functions are meant for constructing the alternative design matrix that will be used in the alternative
argument of the gt
function.
bbase
constructs the B-spline basis for the covariate x
. This function is based on the functions provided by Eilers and Marx (1996).
btensor
builts a tensor product of B-splines for the covariates xs
, which is reparameterized according with a Kroneker sum of penalties.
reparamZ
reparameterizes the alternative design matrix (e.g. a spline basis B
) according with the order of differences pord
or via the spectral decomposition of a roughness matrix K
. When several smooth terms are to be combined, reweighZ
assigns equal weight to each component term.
See the vignette for more examples.
The function returns an object of class gt.object
. Several operations and diagnostic plots can be made from this object.
Currently linear (normal), logistic, multinomial logistic and Poisson regression models with canonical links and Cox's proportional hazards regression model are supported.
Aldo Solari: [email protected]
Eilers, Marx (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11: 89-121.
le Cessie, van Houwelingen (1995). Testing the Fit of a Regression Model Via Score Tests in Random Effects Models. Biometrics 51: 600-614.
For references related to applications of the test, see the vignette GlobalTest.pdf included with this package.
The gt
function. The gt.object
and useful functions associated with that object.
# Random data set.seed(0) X1<-runif(50) s1 <- function(x) exp(2 * x) e <- rnorm(50) Y <- s1(X1) + e ### gtPS res<-gtPS(Y~X1) res@result sterms(res) # model input rdata<-data.frame(Y,X1) nullmodel<-lm(Y~X1,data=rdata) gtPS(nullmodel) # formula input and termlabels gtPS(Y~exp(2*X1),data=rdata) gtPS(Y~exp(2*X1),covs="exp(2 * X1)",data=rdata) sterms(gtPS(Y~exp(2*X1),data=rdata,termlabels=TRUE)) # P-splines arguments gtPS(Y~X1, bdeg=3, nint=list(a=10, b=30), pord=0) gtPS(Y~X1, bdeg=3, nint=list(a=10, b=30), pord=0, robust=TRUE) # Random data: additive model X2<-runif(50) s2 <- function(x) 0.2 * x^11 * (10 * (1 - x))^6 + 10 * (10 * x)^3 * (1 - x)^10 Y <- s1(X1) + s2(X2) + e gtPS(Y~X1+X2) gtPS(Y~X1+X2, covs="X2") sterms(gtPS(Y~X1+X2, nint=list(a=c(10,30), b=20))) # Random data: smooth surface s12 <- function(a, b, sa = 1, sb = 1) { (pi^sa * sb) * (1.2 * exp(-(a - 0.2)^2/sa^2 - (b - 0.3)^2/sb^2) + 0.8 * exp(-(a - 0.7)^2/sa^2 - (b - 0.8)^2/sb^2)) } Y <- s12(X1,X2) + e # Tensor product of P-splines res<-gtPS(Y~X1*X2, interact=TRUE) res@result sterms(res) ### gtKS res<-gtKS(Y~X1*X2) res@result sterms(res) gtKS(Y~X1*X2, quant=seq(.05,.95,.05), robust=TRUE) ### gtLI library(MASS) data(Boston) gtLI(medv~., data=Boston, standardize=TRUE)
# Random data set.seed(0) X1<-runif(50) s1 <- function(x) exp(2 * x) e <- rnorm(50) Y <- s1(X1) + e ### gtPS res<-gtPS(Y~X1) res@result sterms(res) # model input rdata<-data.frame(Y,X1) nullmodel<-lm(Y~X1,data=rdata) gtPS(nullmodel) # formula input and termlabels gtPS(Y~exp(2*X1),data=rdata) gtPS(Y~exp(2*X1),covs="exp(2 * X1)",data=rdata) sterms(gtPS(Y~exp(2*X1),data=rdata,termlabels=TRUE)) # P-splines arguments gtPS(Y~X1, bdeg=3, nint=list(a=10, b=30), pord=0) gtPS(Y~X1, bdeg=3, nint=list(a=10, b=30), pord=0, robust=TRUE) # Random data: additive model X2<-runif(50) s2 <- function(x) 0.2 * x^11 * (10 * (1 - x))^6 + 10 * (10 * x)^3 * (1 - x)^10 Y <- s1(X1) + s2(X2) + e gtPS(Y~X1+X2) gtPS(Y~X1+X2, covs="X2") sterms(gtPS(Y~X1+X2, nint=list(a=c(10,30), b=20))) # Random data: smooth surface s12 <- function(a, b, sa = 1, sb = 1) { (pi^sa * sb) * (1.2 * exp(-(a - 0.2)^2/sa^2 - (b - 0.3)^2/sb^2) + 0.8 * exp(-(a - 0.7)^2/sa^2 - (b - 0.8)^2/sb^2)) } Y <- s12(X1,X2) + e # Tensor product of P-splines res<-gtPS(Y~X1*X2, interact=TRUE) res@result sterms(res) ### gtKS res<-gtKS(Y~X1*X2) res@result sterms(res) gtKS(Y~X1*X2, quant=seq(.05,.95,.05), robust=TRUE) ### gtLI library(MASS) data(Boston) gtLI(medv~., data=Boston, standardize=TRUE)
The class gt.object is the output of a call to
gt
. It stores the information needed for various diagnostic plots.
These slots are not meant to be directly accessed by the user.
result
:Object of class "matrix". The number of rows of this matrix is the number of tests performed. The matrix has at least the columns "p-value", "Statistic" "Expected", "Std.dev", and "#Cov".
extra
:Object of class "data.frame". Holds additional information that may be added later about the tests performed, such as multiplicity-adjusted p-values (see p.adjust
), alias names for tests and comparative proportions (see comparative
).
call
:The matched call to gt
.
functions
:A "list" of various functions used by the covariates
and subjects
functions and the various methods.
subsets
:A "list" or "NULL". Stores the subsets tested, if more than one.
structure
:A "list" or "NULL". Stores subset and superset relationships between the sets in the "subsets" slot.
weights
:A "list" or "NULL". Stores the weight vectors used for testing, if more than one.
alternative
:If gt
was called with x = TRUE
, stores the design matrix of the alternative hypothesis; "NULL" otherwise.
null
:If gt
was called with x = TRUE
, stores the design matrix of the null hypothesis; "NULL" otherwise.
directional
Stores the directional
argument of the call to gt
.
legend
Object of class "list". Stores appropriate legends for the covariates
and subjects
plots.
model
Object of class "character". Stores the model.
(gt.object): Prints the test results: p-value, test statistic, expected value of the test statistic under the null hypothesis, standard deviation of the test statistic under the null hypothesis, and number of covariates tested.
(gt.object): Prints the test results (as show
) plus additional information on the model and the test.
(gt.object): Extracts the p-values.
(gt.object): Extracts z-score: (Test statistic - Expected value) / Standard deviation.
(gt.object): Extracts the results matrix together with the additional (e.g. multiple testing) information in the extra
slot.
(gt.object): Extracts the results matrix for the leaf nodes after a call to link{covariates}
, with information on direction of association.
(gt.object): Sorts the pathways to increasing p-values. Equal p-values are sorted on decreasing z-scores.
(gt.object): Extracts results of one or more test results if multiple tests were performed. Identical to "[[".
(gt.object): Extracts results of one or more test results if multiple tests were performed. Identical to "[".
(gt.object): The number of tests performed.
(gt.object): Extracts a vector with the number of alternative covariates tested for each test.
(gt.object): Extracts the row names of the results matrix.
(gt.object): Changes the row names of the results matrix. Duplicate names are not allowed, but see alias
.
(gt.object): Extracts the "alias" column of the results matrix that can be used to add additional information on each test perfomed.
(gt.object): Changes the "alias" column of the results matrix. Note that unlike for names, duplicate aliases are allowed.
(gt.object): extracts the effective weights of the covariates as they are used internally by the test.
(gt.object): extracts the "subsets" slot.
(gt.object): Produces a histogram to visualize the permutation test statistics. Only relevant after permutation testing.
(gt.object): Produces a plot to show the influence of individual covariates on the test result. See covariates
for details.
(gt.object): Produces a plot to show the influence of individual subjects on the test result. See subjects
for details.
(gt.object): Performs multiple testing correction and produces multiplicity-corrected p-values. See p.adjust
for details.
(gt.object): Compares the p-values of tests performed on a subsets or weights with p-values of random subsets of covariates of same size or randomly distributed weights. See comparative
for details.
(gt.object): Prints the smooth terms specified by gtPS
, gtKS
or gtLI
.
Jelle Goeman: [email protected]; Jan Oosting
gt
, covariates
, subjects
.
# Simple examples with random data here # Real data examples in the Vignette # Random data: covariates A,B,C are correlated with Y Y <- rnorm(20) X <- matrix(rnorm(200), 20, 10) X[,1:3] <- X[,1:3] + 0.5*Y colnames(X) <- LETTERS[1:10] # Make a gt.object sets <- list(odd = c(1,3,5,7,9), even = c(2,4,6,8,10)) res <- gt(Y, X, subsets=sets) # Show the results res summary(res) sort(res) p.value(res) subsets(res) # Names names(res) names(res) <- c("ODD", "EVEN") alias(res) <- c("odd covariates", "even covariates") # Multiple testing p.adjust(res, method = "holm") p.adjust(res, method = "BH") # Diagnostics weights(res) covariates(res[1]) extract(covariates(res[1])) subjects(res[1]) # Permutation testing res <- gt(Y, X, perm = 1e4) hist(res)
# Simple examples with random data here # Real data examples in the Vignette # Random data: covariates A,B,C are correlated with Y Y <- rnorm(20) X <- matrix(rnorm(200), 20, 10) X[,1:3] <- X[,1:3] + 0.5*Y colnames(X) <- LETTERS[1:10] # Make a gt.object sets <- list(odd = c(1,3,5,7,9), even = c(2,4,6,8,10)) res <- gt(Y, X, subsets=sets) # Show the results res summary(res) sort(res) p.value(res) subsets(res) # Names names(res) names(res) <- c("ODD", "EVEN") alias(res) <- c("odd covariates", "even covariates") # Multiple testing p.adjust(res, method = "holm") p.adjust(res, method = "BH") # Diagnostics weights(res) covariates(res[1]) extract(covariates(res[1])) subjects(res[1]) # Permutation testing res <- gt(Y, X, perm = 1e4) hist(res)
Sets various global options for the functions in the globaltest package.
gt.options (trace, trim, transpose, max.print, warn.deprecated)
gt.options (trace, trim, transpose, max.print, warn.deprecated)
trace |
(Default: |
trim |
(Default: |
transpose |
(Default: |
max.print |
(Default: |
warn.deprecated |
(Default: |
The globaltest options can be set during a session, and apply to all calls to functions in the globaltest package for the rest of the session. They are not remembered between sessions.
Jelle Goeman: [email protected]
The gt
function.
# setting options gt.options(max.print=45, trim=TRUE) # reading options gt.options()
# setting options gt.options(max.print=45, trim=TRUE) # reading options gt.options()
Fits a multinomial logistic regression model to a nominal scale outcome.
mlogit(formula, data, control = glm.control())
mlogit(formula, data, control = glm.control())
formula |
An object of class |
data |
An optional data frame containing the variables in the model. If not found in 'data', the variables are taken from the environment from which 'mlogit' is called. |
control |
A list of parameters for controlling the fitting process.
See the documentation of |
The function mlogit fits a multinomial logistic regression
model for a multi-valued outcome with nominal scale. The
implementation and behaviour are designed to mimic those of
glm
, but the options are (as yet) more
limited. Missing values are not allowed in the data.
The model is fitted without using a reference outcome category; the parameters are made identifiable by the requirement that the sum of corresponding regression coefficients over the outcome categories is zero.
An object of (S4) class mlogit
. The class has slots:
coefficients (matrix), standard.err (matrix), fitted.values
(matrix), x (matrix), y (matrix), formula (formula), call (call),
df.null (numeric), df.residual (numeric), null.deviance (numeric),
deviance (numeric), iter (numeric), converged (logical).
Methods implemented for the mlogit
class are
coefficients
, fitted.values
, residuals
and
which extract the relevant quantities, and summary
, which
gives the same output as with a glm
object.
Jelle Goeman: [email protected]; Jan Oosting
y <- factor(rep(1:4, 5)) x <- 1:20 fit <- mlogit(y ~ x) summary(fit) residuals(fit)
y <- factor(rep(1:4, 5)) x <- 1:20 fit <- mlogit(y ~ x) summary(fit) residuals(fit)
A collection of multiple testing procedures for the Global Test. Methods for the focus level procedure of Goeman and Mansmann for graph-structured hypotheses, and for the inheritance procedure based on Meinshausen.
# The focus level method: focusLevel (test, sets, focus, ancestors, offspring, stop = 1, atoms = TRUE, trace) findFocus (sets, ancestors, offspring, maxsize = 10, atoms = TRUE) # The inheritance method: inheritance (test, sets, weights, ancestors, offspring, Shaffer, homogeneous = TRUE, trace) # Utilities for focus level and inheritance method: leafNodes (object, alpha=0.05, type = c("focuslevel","inheritance")) draw (object, alpha = 0.05, type = c("focuslevel","inheritance"), names=FALSE, sign.only = FALSE, interactive = FALSE)
# The focus level method: focusLevel (test, sets, focus, ancestors, offspring, stop = 1, atoms = TRUE, trace) findFocus (sets, ancestors, offspring, maxsize = 10, atoms = TRUE) # The inheritance method: inheritance (test, sets, weights, ancestors, offspring, Shaffer, homogeneous = TRUE, trace) # Utilities for focus level and inheritance method: leafNodes (object, alpha=0.05, type = c("focuslevel","inheritance")) draw (object, alpha = 0.05, type = c("focuslevel","inheritance"), names=FALSE, sign.only = FALSE, interactive = FALSE)
object |
A |
test |
Either a |
sets |
A named |
focus |
The focus level of the focus level method. Must be a subset of |
ancestors |
An environment or list that maps each set in |
offspring |
An environment or list that maps each set in |
stop |
Determines when to stop the algorithm. If |
atoms |
If set to |
trace |
If set to |
maxsize |
Parameter to choose the height of the focus level. The focus level sets are chosen in such a way that the number of tests that is to be done for each
focus level set is at most |
alpha |
The alpha level of familywise error control for the significant subgraph. |
Shaffer |
If set to |
weights |
Optional weights vector for the leaf nodes. If it is missing but |
homogeneous |
If set to |
type |
Argument for specifying which multiple testing correction method should be used. Only relevant if both the inheritance and the focuslevel procedures were performed on the same set of test results. |
names |
If set to |
sign.only |
If set to |
interactive |
If set to |
Multiple testing correction becomes important if the Global Test is performed on many covariate subsets.
If the hypotheses are structured in such a way that many of the tested subsets are subsets of other sets, more powerful procedures can be applied that take advantage of this structure to gain power. Two methods are implemented in the globaltest
package: the inheritance
method for tree-structured hypotheses and the focusLevel
method for general directed acyclic graphs. For simple multiple testing that does not use such structure, see p.adjust
.
The focusLevel
procedure makes use of the fact that some sets are subsets or supersets of each other, as specified by the user in the offspring
and
ancestors
arguments. Viewing the subset and superset structure as a graph, the procedure starts testing at a focus
level: a subset of the nodes of the graph.
If the procedure finds significance at this focus level, it proceeds to find significant subsets and supersets of the focus level sets. Like Holm's procedure, the focus
level procedure is valid regardless of the correlation structure between the test statistics.
The focus level method requires the choice of a “focus level” in the graph. The findFocus
function is a utility function for automatically choosing a focus level. It chooses a collection of focus level sets in such a way that the number of tests to be done for each focus level node is at most 2^maxsize
. In practice this usually means that each focus level node has at most maxsize
leaf nodes as offspring. Choosing focus level nodes with too many offspring nodes may result in excessively long computation times.
The inheritance
method is an alternative method for calculating familywise error rate corrected p-values. Like the focus level method, inheritance
also makes use of the structure of the tested sets to gain power. In this case, however, the graph is restricted to a tree, as can be obtained for example if the tested subsets are obtained from a hierarchical clustering. The inheritance procedure is used in the covariates
function. Like Holm's method and the focus level method, the inheritance procedure makes no assumptions on the joint distribution of the test statistics.
The leafNodes
function extracts the leaf nodes of the significant subgraph after a focus level procedure was performed. As this graph is defined by its leaf nodes, this is the most efficient summary of the test result. Only implemented for gt.object
input.
The draw
function draws the graph, displaying the significant nodes. It either draws the full graph with the non-significant nodes grayed out (sign.only = TRUE
), or it draws only the subgraph corresponding to the significant nodes.
See the vignette for extensive applications.
The function multtest
returns an object of class
gt.object
with an
appropriate column added to the test results matrix.
The focusLevel
and inheritance
functions returns a
gt.object
if a
gt.object
argument was given
as input, otherwise it returns a matrix with a column of raw p-values
and a column of corrected p-values.
The function leafNodes
returns a
gt.object
.
findFocus
returns a character vector.
In the graph terminology of the focus level method, ancestor
means superset, and offspring
means subset.
The validity of the focus level procedure depends on certain assumptions on the null hypothesis that is tested for each set. See the paper by Goeman and Mansmann (cited below) for the precise assumptions. Similar assumptions are necessary for the Shaffer improvement of the inheritance procedure.
Jelle Goeman: [email protected]; Livio Finos
The methods used by multtest:
Holm (1979) A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6: 65-70.
Benjamini and Hochberg (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57: 289-300.
Benjamini and Yekutieli (2001) The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29 (4) 1165-1188.
The focus level method:
Goeman and Mansmann (2008) Multiple testing on the directed acyclic graph of gene ontology. Bioinformatics 24 (4) 537-544.
The inheritance method:
Meinshausen (2008) Hierarchical testing of variable importance. Biometrika 95 (2), 265-278.
For references related to applications of the test, see the vignette GlobalTest.pdf included with this package.
The gt
function. The gt.object
function and useful functions associated with that object.
Many more examples in the vignette!
# Simple examples with random data here # Real data examples in the Vignette # Random data: covariates A,B,C are correlated with Y set.seed(1) Y <- rnorm(20) X <- matrix(rnorm(200), 20, 10) X[,1:3] <- X[,1:3] + Y colnames(X) <- LETTERS[1:10] # Some subsets of interest my.sets1 <- list(abc = LETTERS[1:3], cde = LETTERS[3:5], fgh = LETTERS[6:8], hij = LETTERS[8:10]) res <- gt(Y, X, subsets = my.sets1) # Simple multiple testing p.adjust(res) p.adjust(res, "BH") # A whole structure of sets my.sets2 <- as.list(LETTERS[1:10]) names(my.sets2) <- letters[1:10] my.sets3 <- list(all = LETTERS[1:10]) my.sets <- c(my.sets2,my.sets1,my.sets3) # Do the focus level procedure # Choose a focus level by hand my.focus <- c("abc","cde","fgh","hij") # Or automated my.focus <- findFocus(my.sets, maxsize = 8) resF <- focusLevel(res, sets = my.sets, focus = my.focus) leafNodes(resF, alpha = .1) # Compare p.adjust(resF, "holm") # Focus level with a custom test Ftest <- function(set) anova(lm(Y~X[,set]))[["Pr(>F)"]][1] focusLevel(Ftest, sets=my.sets, focus=my.focus) # analyze data using inheritance procedure res <- gt(Y, X, subsets = list(colnames(X))) # define clusters on the covariates X hcl=hclust(dist(t(X))) # Do inheritance procedure resI=inheritance(res, sets = hcl) resI leafNodes(resI, alpha = .1) # inheritance procedure with a custom test inheritance(Ftest, sets = hcl, Shaffer=TRUE)
# Simple examples with random data here # Real data examples in the Vignette # Random data: covariates A,B,C are correlated with Y set.seed(1) Y <- rnorm(20) X <- matrix(rnorm(200), 20, 10) X[,1:3] <- X[,1:3] + Y colnames(X) <- LETTERS[1:10] # Some subsets of interest my.sets1 <- list(abc = LETTERS[1:3], cde = LETTERS[3:5], fgh = LETTERS[6:8], hij = LETTERS[8:10]) res <- gt(Y, X, subsets = my.sets1) # Simple multiple testing p.adjust(res) p.adjust(res, "BH") # A whole structure of sets my.sets2 <- as.list(LETTERS[1:10]) names(my.sets2) <- letters[1:10] my.sets3 <- list(all = LETTERS[1:10]) my.sets <- c(my.sets2,my.sets1,my.sets3) # Do the focus level procedure # Choose a focus level by hand my.focus <- c("abc","cde","fgh","hij") # Or automated my.focus <- findFocus(my.sets, maxsize = 8) resF <- focusLevel(res, sets = my.sets, focus = my.focus) leafNodes(resF, alpha = .1) # Compare p.adjust(resF, "holm") # Focus level with a custom test Ftest <- function(set) anova(lm(Y~X[,set]))[["Pr(>F)"]][1] focusLevel(Ftest, sets=my.sets, focus=my.focus) # analyze data using inheritance procedure res <- gt(Y, X, subsets = list(colnames(X))) # define clusters on the covariates X hcl=hclust(dist(t(X))) # Do inheritance procedure resI=inheritance(res, sets = hcl) resI leafNodes(resI, alpha = .1) # inheritance procedure with a custom test inheritance(Ftest, sets = hcl, Shaffer=TRUE)