Title: | Asessing statistical significance in predictive GWA studies |
---|---|
Description: | Testing individual SNPs, as well as arbitrarily large groups of SNPs in GWA studies, using a joint model of all SNPs. The method controls the FWER, and provides an automatic, data-driven refinement of the SNP clusters to smaller groups or single markers. |
Authors: | Laura Buzdugan |
Maintainer: | Laura Buzdugan <[email protected]> |
License: | GPL-3 |
Version: | 1.37.0 |
Built: | 2024-10-30 08:13:48 UTC |
Source: | https://github.com/bioc/hierGWAS |
Clusters SNPs hierachically.
cluster.snp(x = NULL, d = NULL, method = "average", SNP_index = NULL)
cluster.snp(x = NULL, d = NULL, method = "average", SNP_index = NULL)
x |
The SNP data matrix of size |
d |
|
method |
The agglomeration method to be used. This should be
(an unambiguous abbreviation of) one of |
SNP_index |
|
The SNPs are clustered using hclust
,
which performs a hierarchical cluster analysis using a set of dissimilarities
for the nvar
objects being clustered. There are 3 possible scenarios.
If d = NULL
, x
is used to compute the dissimilarity matrix.
The dissimilarity measure between two SNPs is 1 - LD
(Linkage
Disequilibrium), where LD
is defined as the square of the Pearson
correlation coefficient. If SNP_index = NULL
, all nvar
SNPs will
be clustered; otherwise only the SNPs with indices specified by SNP_index
will be considered.
If the user wishes to use a different dissimilarity measure, d
needs
to be provided. d
must be either a square matrix of size
nvar x nvar
, or an object of class dist
. If d
is
provided, x
and SNP_index
will be ignored.
An object of class dendrogram
which describes the tree
produced by the clustering algorithm hclust
.
library(MASS) x <- mvrnorm(60,mu = rep(0,60), Sigma = diag(60)) clust.1 <- cluster.snp(x = x, method = "average") SNP_index <- seq(1,10) clust.2 <- cluster.snp(x = x, method = "average", SNP_index = SNP_index) d <- dist(x) clust.3 <- cluster.snp(d = d, method = "single")
library(MASS) x <- mvrnorm(60,mu = rep(0,60), Sigma = diag(60)) clust.1 <- cluster.snp(x = x, method = "average") SNP_index <- seq(1,10) clust.2 <- cluster.snp(x = x, method = "average", SNP_index = SNP_index) d <- dist(x) clust.3 <- cluster.snp(d = d, method = "single")
Calculates the R2 of a cluster of SNPs.
compute.r2(x, y, res.multisplit, covar = NULL, SNP_index = NULL)
compute.r2(x, y, res.multisplit, covar = NULL, SNP_index = NULL)
x |
The input matrix, of dimension |
y |
The response vector. It can be continuous or discrete. |
res.multisplit |
The output of |
covar |
|
SNP_index |
|
The R2 of a cluster of SNPs is computed on the second half-samples.
The cluster members, are intersected with the SNPs selected by the lasso, and
the R2 of this model is calculated. Thus, we obtain B R2 values. Finally, the
mean of these values is taken. If the value of SNP_index
is NULL
, the R2 of the full
model with all the SNPs will be computed.
The R2 value of the SNP cluster
Buzdugan, L. et al. (2015), Assessing statistical significance in predictive genome-wide association studies. (unpublished)
library(MASS) x <- mvrnorm(60,mu = rep(0,60), Sigma = diag(60)) beta <- rep(0,60) beta[c(5,9,3)] <- 1 y <- x %*% beta + rnorm(60) SNP_index <- c(5,9,3) res.multisplit <- multisplit(x, y) r2 <- compute.r2(x, y, res.multisplit, SNP_index = SNP_index)
library(MASS) x <- mvrnorm(60,mu = rep(0,60), Sigma = diag(60)) beta <- rep(0,60) beta[c(5,9,3)] <- 1 y <- x %*% beta + rnorm(60) SNP_index <- c(5,9,3) res.multisplit <- multisplit(x, y) r2 <- compute.r2(x, y, res.multisplit, SNP_index = SNP_index)
Testing individual SNPs, as well as arbitrarily large groups of SNPs in GWA studies, using a joint model of all SNPs. The method controls the FWER, and provides an automatic, data-driven refinement of the SNP clusters to smaller groups or single markers.
hierGWAS is a package designed to assess statistical significance in GWA studies, using a hierarchical approach.
There are 4 functions provided: cluster.snp
,
multisplit
, test.hierarchy
and
compute.r2
. cluster.snp
performs
the hierarchical clustering of the SNPs, while multisplit
runs multiple penalized regressions on repeated random subsamples. These 2 functions
need to be executed before test.hierarchy
, which does the
hierarchical testing, though the order in which the 2 functions are executed does
not matter. test.hierarchy
provides the final output of
the method: a list of SNP groups or individual SNPs, along with their corresponding
p-values. Finally, compute.r2
computes the explained variance
of an arbitrary group of SNPs, of any size. This group can encompass all SNPs, SNPs
belonging to a certain chromosome, or an individual SNP.
Laura Buzdugan [email protected]
Buzdugan, L. et al. (2015), Assessing statistical significance in predictive genome-wide association studies (unpublished)
Performs repeated variable selection via the lasso on random sample splits.
multisplit(x, y, covar = NULL, B = 50)
multisplit(x, y, covar = NULL, B = 50)
x |
The SNP data matrix, of size |
y |
The response vector. It can be continuous or discrete. |
covar |
NULL or the matrix of covariates one wishes to control for, of
size |
B |
The number of random splits. Default value is 50. |
The samples are divided into two random splits of approximately
equal size. The first subsample is used for variable selection, which is
implemented using glmnet. The first [nobs/6]
variables
which enter the lasso path are selected. The procedure is repeated B
times.
If one or more covariates are specified, these will be added unpenalized to the regression.
A data frame with 2 components. A matrix of size B x [nobs/2]
containing the second subsample of each split, and a matrix of size
B x [nobs/6]
containing the selected variables in each split.
Meinshausen, N., Meier, L. and Buhlmann, P. (2009), P-values for high-dimensional regression, Journal of the American Statistical Association 104, 1671-1681.
library(MASS) x <- mvrnorm(60,mu = rep(0,200), Sigma = diag(200)) beta <- rep(1,200) beta[c(5,9,3)] <- 3 y <- x %*% beta + rnorm(60) res.multisplit <- multisplit(x, y)
library(MASS) x <- mvrnorm(60,mu = rep(0,200), Sigma = diag(200)) beta <- rep(1,200) beta[c(5,9,3)] <- 3 y <- x %*% beta + rnorm(60) res.multisplit <- multisplit(x, y)
This data set was simulated using PLINK
. Please refer to the vignette
for more details.
simGWAS
simGWAS
The dataset contains the following components:
SNP.1
The first SNP, of dimension 500 x 1
. Each
row represents a subject.
...
SNP.1000
The last SNP, of dimension 500 x 1
. Each
row represents a subject.
y
The response vector. It can be continuous or discrete.
sex
The first covariate, represeting the sex of the subjects: 0 for men and 1 for women.
age
The second covariate, represeting the age of the subjects.
data.frame
data(simGWAS)
data(simGWAS)
Performs hierarchical testing of SNPs.
test.hierarchy(x, y, dendr, res.multisplit, covar = NULL, SNP_index = NULL, alpha = 0.05, global.test = TRUE, verbose = TRUE)
test.hierarchy(x, y, dendr, res.multisplit, covar = NULL, SNP_index = NULL, alpha = 0.05, global.test = TRUE, verbose = TRUE)
x |
The input matrix, of dimension |
y |
The response vector. It can be continuous or discrete. |
dendr |
The cluster tree obtained by hierchically clustering the SNPs
using |
res.multisplit |
The output of |
covar |
|
SNP_index |
|
alpha |
The significance level at which the FWER is controlled. Default value is 0.05. |
global.test |
Specifies wether the global null hypothesis should be
tested. Default value is |
verbose |
Report information on progress. Default value is |
The testing is performed on the cluster tree given by dendr
.
If the SNP data matrix was divided (e.g. by chromosome), and clustered
separately, the user must provide the argument SNP_index
, to specify
which part of the data is being tested.
Testing starts at the highest level, which includes all variables specified
by SNP_index
, and moves down the cluster tree. It stops when a cluster's
null hypothesis cannot be rejected anymore. The smallest, still significant
clusters will be returned.
By default the parameter global.test = TRUE
, which means that first
the global null hypothesis is tested. If the data is divided (e.g. by
chromosome), and clustered separately, this parameter can be set to
FALSE
once the global null has been rejected. This helps save time.
A list of significant SNP groups with the following components:
SNP_index |
The indeces of the SNPs in the group |
pval |
The p-value of the SNP group |
Buzdugan, L. et al. (2015), Assessing statistical significance in predictive genome-wide association studies
library(MASS) x <- mvrnorm(60,mu = rep(0,60), Sigma = diag(60)) beta <- rep(0,60) beta[c(5,9,3)] <- 1 y <- x %*% beta + rnorm(60) dendr <- cluster.snp(x = x, method = "average") res.multisplit <- multisplit(x, y) sign.clusters <- test.hierarchy(x, y, dendr, res.multisplit)
library(MASS) x <- mvrnorm(60,mu = rep(0,60), Sigma = diag(60)) beta <- rep(0,60) beta[c(5,9,3)] <- 1 y <- x %*% beta + rnorm(60) dendr <- cluster.snp(x = x, method = "average") res.multisplit <- multisplit(x, y) sign.clusters <- test.hierarchy(x, y, dendr, res.multisplit)