Title: | An R package for analysis of case-control studies in genetic epidemiology |
---|---|
Description: | This is a package for analysis of case-control data in genetic epidemiology. It provides a set of statistical methods for evaluating gene-environment (or gene-genes) interactions under multiplicative and additive risk models, with or without assuming gene-environment (or gene-gene) independence in the underlying population. |
Authors: | Samsiddhi Bhattacharjee [aut], Nilanjan Chatterjee [aut], Summer Han [aut], Minsun Song [aut], William Wheeler [aut], Matthieu de Rochemonteix [aut], Nilotpal Sanyal [aut], Justin Lee [cre] |
Maintainer: | Justin Lee <[email protected]> |
License: | GPL-2 + file LICENSE |
Version: | 3.43.0 |
Built: | 2024-10-30 04:34:57 UTC |
Source: | https://github.com/bioc/CGEN |
Performs a likelihood ratio test for gene-environment interaction under an additive risk model for case-control data using a standard logistic regression. A set of constraints is imposed to log-odds-ratio parameters to approximate the null model of no interaction under additive risk models. The additive interaction test under gene-environment independence assumption is performed by utilizing the retrospective likelihood by Chatterjee and Carroll (2005).
additive.test(data, response.var, snp.var, exposure.var, main.vars=NULL, strata.var=NULL, op=NULL)
additive.test(data, response.var, snp.var, exposure.var, main.vars=NULL, strata.var=NULL, op=NULL)
data |
Data frame containing all the data. No default. |
response.var |
Name of the binary response variable coded as 0 (controls) and 1 (cases). No default. |
snp.var |
Name of the genotype variable coded as 0, 1, 2 (or 0, 1). No default. |
exposure.var |
Name of the exposure variable coded as 0, 1, 2 (or 0, 1). No default. |
main.vars |
Character vector of variable names or a formula for all covariates of interest which need to be included in the model as main effects. Do not include snp.var and exposure.var here, as their main effects will automatically be included in the model. The default is NULL. |
strata.var |
For a retrospective likelihood, name of the stratification variable, such as race, ethnicity, geographic origin, or principal components of population stratification markers (Bhattacharjee et al. 2010 AJHG). It is highly recommended to include it if present. This option takes into account gene-environment dependence due to population stratification by allowing the genotype frequency to vary by the discrete level of the stratification variable. The default is NULL. |
op |
A list of options with possible names |
A maximum likelihood for a full model is obtained by optimizing a logistic regression model using a standard binomial likelihood (i.e. prospective likelihood) while a maximum likelihood for a null model is obtained by fitting a reduced model with a set of constaints imposed on logistic regression parameters to approximate the null model of no interaction in an additive risk model. The additive interaction test under the gene-environment independence assumption can be conducted by utilizing the retrospective likelihood by Chatterjee and Carroll (2005). The following is the definition of the likelihood under the gene-environment independence assumption:
Definition of the likelihood under the gene-environment independence assumption:
Let D = 0, 1 be the case-control status, G = 0, 1, 2 denote the SNP genotype, S = 1, ..., k
denote the levels of the stratification variable and Z be the design matrix for all the covariates including G, the interactions, and a column for the intercept parameter.
If denotes the allele frequency for stratum s, then
If , then
and
Let
Then the likelihood for a subject is
where the sum is taken over the 6 combinations of d and g.
Options list:
Below are the names for the options list op
. All names have default values
if they are not specified.
genetic.model
1-3 where 1=dominant, 2=recessive, 3=general.
The default is 3.
optim.method
One of "BFGS", "CG", "L-BFGS-B", "Nelder-Mead", "SANN".
The default is "BFGS".
indep
TRUE for using a retrospective likelihood for gene-environment independence assumption.
FALSE for using a standard prospective likelihood. The default is FALSE.
reltol
Stopping tolerance. The default is 1e-7.
maxiter
Maximum number of iterations. The default is 500.
A list containing the following:
additive
pval.add.LRT
P-value of the additive interaction likelihood ratio test.
pval.add.UML
P-value of the additive interaction UML Wald test.
pval.add.CML
(only for indep=T) P-value of the additive interaction CML Wald test.
pval.add.EB
(only for indep=T) P-value of the additive interaction EB Wald test.
LRT.add
Likelihood ratio test statistic for the additive interaction.
RERI.UML
UML estimate of RERI.
RERI.CML
(only for indep=T) CML estimate of RERI.
RERI.EB
(only for indep=T) EB estimate of RERI.
RERI.test
(only for 2x2 method) Test based on RERI.
S.test
(only for 2x2 method and indep=F) Test based on Synergy Index.
AP.test
(only for 2x2 method and indep=F) Test based on "Attributable Proportion due to interaction".
multiplicative (only for genetic.model != 0)
pval.mult.LRT
P-value of the multiplicative interaction likelihood ratio test.
pval.mult.UML
P-value of the multiplicative interaction UML Wald test.
pval.mult.CML
(only for indep=T) P-value of the multiplicative interaction CML Wald test.
pval.mult.EB
(only for indep=T) P-value of the multiplicative interaction EB Wald test.
LRT.mult
Likelihood ratio test statistic for the multiplicative interaction.
model.info
response.var
Response variable name
snp.var
SNP variable name
exposure.var
Exposure variable name
main.vars
Names of covariates which are included in the model as main effects. This does not include snp.var and exposure.var, whose main effects are included in the model by default.
strata.var
Name of the stratification variable for a retrospective likelihood.
op
List of options used in the fit; includes genetic.model, optim.method, indep, maxiter, reltol
GxEtable
The frequency table defined by table(snp.var, exposure.var)
method
One of "2x2", "2x3", "3x2", "3x3", "3x2trend" ("3x3trend" is currently not available)
parms.lm.UML
UML parameters estimates
parms.lm.CML
(only for indep=T) CML parameters estimates
parms.lm.EB
(only for indep=T) EB parameters estimates
cov.lm.UML
UML parameters covariance matrix estimate
cov.lm.CML
(only for indep=T) CML parameters covariance matrix estimate
cov.lm.EB
(only for indep=T) EB parameters covariance matrix estimate
full.loglike.UML
(only for indep=T) Loglikelihood of UML full model fit
full.loglike.CML
(only for indep=T) Loglikelihood of CML full model fit
DF
Degrees of freedom of the additive/multiplicative interaction test.
optim.out
(only for genetic.model != 0) Optimization output of the optim() function for a null model under an additive model restriction.
OR.table.UML
(only for indep=F) Odds ratio table UML estimate
OR.table.CML
(only for indep=T) Odds ratio table CML estimate
Sanyal, N., Napolioni, V., de Rochemonteix, M., Belloy, M. E., Caporaso, N. E., Landi, M. T., Greicius, M. D., Chatterjee, N., Han, S. S. A robust test for additive gene-environment interaction under the trend effect of genotype using an empirical Bayes-type shrinkage estimator. American Journal of Epidemiology 2021 (in press).
de Rochemonteix M, Napolioni V, Sanyal N, et al. A likelihood ratio test for gene-environment interaction based on the trend effect of genotype under an additive risk model using the gene-environment independence assumption. American Journal of Epidemiology 2021;190(1):129-41.
Han, S. S, Rosenberg P. S, Garcia-Closas M, Figueroa J. D, Silverman D, Chanock S. J, Rothman N, and Chatterjee N. Likelihood ratio test for detecting gene (G) environment (E) interactions under the additive risk model exploiting G-E independence for case-control data. Am J of Epidemiol, 2012; 176:1060-7.
Chatterjee, N. and Carroll, R. Semiparametric maximum likelihood estimation exploting gene-environment independence in case-control studies. Biometrika, 2005, 92, 2, pp.399-418.
Bhattacharjee, S., Wang, Z., Ciampa, J., Kraft, P., Chanock, S., Yu, K. and Chatterjee, N. Using principal components of genetic variation for robust and powerful detection of gene-gene interactions in case-control and case-only studies. Am J of Hum Genet, 2010; 86(3), 331-342.
# # # EXAMPLE 1: Use the lung cancer data data(Xdata2, package="CGEN") # <<1A>> Standard (prospective) likelihood additive GXE analysis under the additive genetic model # (i.e. the trend effect of genotype) using the method from de Rochemonteix et al. 2021) A0F <- additive.test(data = Xdata2, response.var = "case.control", snp.var = "SNP", exposure.var = "smoking", main.vars = c("cov1", "cov2", "cov3", "cov4", "study"), strata.var = "study", op = list(genetic.model = 0)) A0F$pval.add.LRT # P-value of the additive interaction likelihood ratio test. # <<1B>> Retrospective likelihood additive GXE analysis under the additive genetic model # (i.e. the trend effect of genotype) using the method (LRT) from de Rochemonteix et al. 2021 # and, for the empirical-Bayes type shrinkage estimator, the method (Wald test) from # Sanyal et al. 2021) A0T <- additive.test(data = Xdata2, response.var = "case.control", snp.var = "SNP", exposure.var = "smoking", main.vars = c("cov1", "cov2", "cov3", "cov4", "study"), strata.var = "study", op = list(genetic.model = 0, indep = TRUE)) A0T$pval.add.LRT # P-value of the additive interaction likelihood ratio test. A0T$pval.add.EB # P-value of the additive interaction EB Wald test #(that is the weighted average of the CML and UML methods.) A0T$pval.add.UML # P-value of the additive interaction UML Wald test #(that does not assume G-E independence). A0T$pval.add.CML # P-value of the additive interaction CML Wald test #(that assumes G-E independence). # <<1C>> Standard (prospective) likelihood additive GXE analysis under the dominant genetic model # (using the method from Han et al. 2012) A1F <- additive.test(data = Xdata2, response.var = "case.control", snp.var = "SNP", exposure.var = "smoking", main.vars = c("cov1", "cov2", "cov3", "cov4", "study"), strata.var = "study", op = list(genetic.model = 1)) A1F$pval.add.LRT # P-value of the additive interaction likelihood ratio test. # <<1D>> Standard (prospective) likelihood additive GXE analysis under the dominant genetic model # (using the methofd from Han et al. 2012) A1T <- additive.test(data = Xdata2, response.var = "case.control", snp.var = "SNP", exposure.var = "smoking", main.vars = c("cov1", "cov2", "cov3", "cov4", "study"), strata.var = "study", op = list(genetic.model = 1, indep = TRUE)) A1T$pval.add.LRT # P-value of the additive interaction likelihood ratio test. # # # EXAMPLE 2: Use the ovarian cancer data data(Xdata, package="CGEN") table(Xdata[, "gynSurgery.history"]) # Recode the exposure variable so that it is 0-1 temp <- Xdata[, "gynSurgery.history"] == 2 Xdata[temp, "gynSurgery.history"] <- 1 # <<2A>> Standard (prospective) likelihood additive GXE analysis under the dominant genetic model # (using the method from Han et al. 2012) out1 <- additive.test(data=Xdata, response.var="case.control", snp.var="BRCA.status", exposure.var="gynSurgery.history", main.vars=c("n.children","oral.years"), op=list(genetic.model=1)) # <<2B>> Retrospective likelihood (indep = TRUE) additive GXE analysis under the dominant genetic model # (using the method from Han et al. 2012) out2 <- additive.test(data=Xdata, response.var="case.control", snp.var="BRCA.status", exposure.var="gynSurgery.history", main.vars=~n.children+oral.years, strata.var="ethnic.group", op=list(indep=TRUE, genetic.model=1))
# # # EXAMPLE 1: Use the lung cancer data data(Xdata2, package="CGEN") # <<1A>> Standard (prospective) likelihood additive GXE analysis under the additive genetic model # (i.e. the trend effect of genotype) using the method from de Rochemonteix et al. 2021) A0F <- additive.test(data = Xdata2, response.var = "case.control", snp.var = "SNP", exposure.var = "smoking", main.vars = c("cov1", "cov2", "cov3", "cov4", "study"), strata.var = "study", op = list(genetic.model = 0)) A0F$pval.add.LRT # P-value of the additive interaction likelihood ratio test. # <<1B>> Retrospective likelihood additive GXE analysis under the additive genetic model # (i.e. the trend effect of genotype) using the method (LRT) from de Rochemonteix et al. 2021 # and, for the empirical-Bayes type shrinkage estimator, the method (Wald test) from # Sanyal et al. 2021) A0T <- additive.test(data = Xdata2, response.var = "case.control", snp.var = "SNP", exposure.var = "smoking", main.vars = c("cov1", "cov2", "cov3", "cov4", "study"), strata.var = "study", op = list(genetic.model = 0, indep = TRUE)) A0T$pval.add.LRT # P-value of the additive interaction likelihood ratio test. A0T$pval.add.EB # P-value of the additive interaction EB Wald test #(that is the weighted average of the CML and UML methods.) A0T$pval.add.UML # P-value of the additive interaction UML Wald test #(that does not assume G-E independence). A0T$pval.add.CML # P-value of the additive interaction CML Wald test #(that assumes G-E independence). # <<1C>> Standard (prospective) likelihood additive GXE analysis under the dominant genetic model # (using the method from Han et al. 2012) A1F <- additive.test(data = Xdata2, response.var = "case.control", snp.var = "SNP", exposure.var = "smoking", main.vars = c("cov1", "cov2", "cov3", "cov4", "study"), strata.var = "study", op = list(genetic.model = 1)) A1F$pval.add.LRT # P-value of the additive interaction likelihood ratio test. # <<1D>> Standard (prospective) likelihood additive GXE analysis under the dominant genetic model # (using the methofd from Han et al. 2012) A1T <- additive.test(data = Xdata2, response.var = "case.control", snp.var = "SNP", exposure.var = "smoking", main.vars = c("cov1", "cov2", "cov3", "cov4", "study"), strata.var = "study", op = list(genetic.model = 1, indep = TRUE)) A1T$pval.add.LRT # P-value of the additive interaction likelihood ratio test. # # # EXAMPLE 2: Use the ovarian cancer data data(Xdata, package="CGEN") table(Xdata[, "gynSurgery.history"]) # Recode the exposure variable so that it is 0-1 temp <- Xdata[, "gynSurgery.history"] == 2 Xdata[temp, "gynSurgery.history"] <- 1 # <<2A>> Standard (prospective) likelihood additive GXE analysis under the dominant genetic model # (using the method from Han et al. 2012) out1 <- additive.test(data=Xdata, response.var="case.control", snp.var="BRCA.status", exposure.var="gynSurgery.history", main.vars=c("n.children","oral.years"), op=list(genetic.model=1)) # <<2B>> Retrospective likelihood (indep = TRUE) additive GXE analysis under the dominant genetic model # (using the method from Han et al. 2012) out2 <- additive.test(data=Xdata, response.var="case.control", snp.var="BRCA.status", exposure.var="gynSurgery.history", main.vars=~n.children+oral.years, strata.var="ethnic.group", op=list(indep=TRUE, genetic.model=1))
The new version extends the additive.test() function by implementing additive gene-gene or gene-environment tests under the trend effect of genotypes (i.e. additive genetic model), extending the previous approaches that were only available for the dominant or recessive genetic models (Han et al. 2012 AJE). Several types of interaction tests under the trend effect of genotype are available – methods with and without the gene-environment independence assumption (de Rochemonteix et al. 2020 AJE) and robust methods based on an empirical Bayes type shrinkage estimator (Sanyal et al. 2021 AJE).
This package is for logistic regression analyses of SNP data in case-control studies. It is
designed to give the users flexibility of using a number of different methods for analysis of
SNP-environment or SNP-SNP interactions. It is known that power of interaction analysis in case-control
studies can be greatly enhanced if it can be assumed that the factors (e.g. two SNPs) under study are
independently distributed in the underlying population. The package implements a number of different methods
that can incorporate such independence constraints into analysis of interactions in the setting of both
unmatched and matched case-control studies. These methods are more general and flexible than the popular
case-only method of analysis of interaction that also assumes gene-gene or/and gene-environment independence for
the underlying factors in the underlying population. The package also implements various methods, based on shrinkage
estimation and conditional-likelihoods, that can automatically adjust for possible violation of the independence assumption
that could arise due to direct causal relationship (e.g. between a gene and a behavior exposure) or
indirect correlation (e.g due to population stratification). A number of convenient summary and printing functions
are included.
In its previous version, the package has been updated with the new function snp.score
that allow
testing for disease-SNP association accounting for gene-environment interaction using an array of different
types of score-tests. The function can handle both genotyped and imputed SNPs.
The latest version of the package has extended the additive.test() function by implementing additive gene by environment tests
under the trend effect of genotypes (additive model). Two types of trend tests are available - likelihood ratio tests and Wald tests
based on an empirical Bayes type shrinkage estimator.
The package will continue to be updated with new methods as they are developed. The methods are currently
not suitable for analysis of SNPs on sex chromosomes.
The main functions for unmatched data are additive.test
, snp.logistic
and snp.score
.
Whereas additive.test
, snp.logistic
and snp.score
analyzes one SNP with each function call, GxE.scan
analyzes a collection of SNPs and writes the summary results to an external file.
With additive.test
and snp.logistic
, a data frame is input in which the SNP variable must be coded as 0-1-2 (or 0-1). The function snp.score
can be used with imputed genotypes, where the SNP variable
is coded as the expected dosage.
The functions
getSummary
, getWaldTest
and snp.effects
can be called for creating summary
tables, computing Wald tests and joint/stratified effects using the returned object from snp.logistic
(see Examples
in snp.logistic
).
With GxE.scan
, the data is read in from external files defined in
snp.list
and pheno.list
.
The collection of p-values computed in GxE.scan
, can be plotted using the functions QQ.plot
and
Manhattan.plot
.
The function for analysis of matched case-control data is snp.matched
. Optimal matching can be obtained
from the function getMatchedSets
.
The current version of the package is only suitable for analysis of SNPs on non-sex chromosomes.
Main functions for single SNP analysis: additive.test
snp.logistic
snp.matched
snp.score
For GWAS analysis: GxE.scan
GxE.scan.combine
GxE.scan.partition
pheno.list
snp.list
subject.list
Graphics: Manhattan.plot
QQ.plot
snp.effects.plot
Sample data: LocusMapData
Xdata
Miscellaneous: getMatchedSets
(Used with snp.matched
) getSummary
(The same as calling summary
) getWaldTest
(For computing Wald tests)locusMap.list
(Used with Manhattan.plot
) printEffects
(The same as calling print
) snp.effects
(For computing joint and stratified effects)
Samsiddhi Bhattacharjee, Summer Han, Minsun Song, Nilanjan Chatterjee, William Wheeler and Nilotpal Sanyal <[email protected]>
Maximum-likelihood estimation under independence
Chatterjee, N. and Carroll, R. Semiparametric maximum likelihood estimation exploting gene-environment independence in case-control studies. Biometrika, 2005, 92, 2, pp.399-418.
Shrinkage estimation
Mukherjee B, Chatterjee N. Exploiting gene-environment independence in analysis of case-control studies: An empirical Bayes approach to trade-off between bias and efficiency. Biometrics 2008, 64(3):685-94.
Mukherjee B et al. Tests for gene-environment interaction from case-control data: a novel study of type I error, power and designs. Genetic Epidemiology, 2008, 32:615-26.
Chen YH, Chatterjee N, Carroll R. Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies. Journal of the American Statistical Association, 2009, 104: 220-233.
Conditional Logistic Regression and Adjustment for Population stratification
Chatterjee N, Zeynep K and Carroll R. Exploiting gene-environmentindependence in family-based case-control studies: Increased power for detecting associations, interactions and joint-effects. Genetic Epidemiology2005; 28:138-156.
Bhattacharjee S, Wang Z, Ciampa J, Kraft P, Chanock S, Yu K, Chatterjee N Using Principal Components of Genetic Variation for Robust and Powerful Detection of Gene-Gene Interactions in Case-Control and Case-Only studies. American Journal of Human Genetics, 2010, 86(3):331-342.
Score tests
Han, S.S., Rosenberg, P., Ghosh, A., Landi M.T., Caporaso N. and Chatterjee, N. An exposure weighted score test for genetic association integrating environmental risk-factors. Biometrics 2015 (Article first published online: 1 JUL 2015 | DOI: 10.1111/biom.12328)
Song M., Wheeler B., Chatterjee, N. Using imputed genotype data in joint score tests for genetic association and gene-environment interactions in case-control studies (In preparation).
Tests for additive interaction
Sanyal, N., Napolioni, V., de Rochemonteix, M., Belloy, M. E., Caporaso, N. E., Landi, M. T., Greicius, M. D., Chatterjee, N., Han, S. S. A robust test for additive gene-environment interaction under the trend effect of genotype using an empirical Bayes-type shrinkage estimator. American Journal of Epidemiology 2021 (in press).
de Rochemonteix M, Napolioni V, Sanyal N, et al. A likelihood ratio test for gene-environment interaction based on the trend effect of genotype under an additive risk model using the gene-environment independence assumption. American Journal of Epidemiology 2021;190(1):129-41.
Han, S. S, Rosenberg P. S, Garcia-Closas M, Figueroa J. D, Silverman D, Chanock S. J, Rothman N, and Chatterjee N. Likelihood ratio test for detecting gene (G) environment (E) interactions under the additive risk model exploiting G-E independence for case-control data. Am J of Epidemiol, 2012; 176:1060-7.
Obtain matching of subjects based on a set of covariates (e.g., principal components of population stratification markers). Two types of matcing are allowed 1) Case-Control(CC) matching and/or 2) Nearest-Neighbour(NN) matching.
getMatchedSets(x, CC, NN, ccs.var=NULL, dist.vars=NULL, strata.var=NULL, size=2, ratio=1, fixed=FALSE)
getMatchedSets(x, CC, NN, ccs.var=NULL, dist.vars=NULL, strata.var=NULL, size=2, ratio=1, fixed=FALSE)
x |
Either a data frame containing variables to be used for matching, or an object returned by |
CC |
Logical. TRUE if case-control matching should be computed, FALSE otherwise. No default. |
NN |
Logical. TRUE if nearest-neighbor matching should be computed, FALSE otherwise. No default. At least one of CC and NN should be TRUE. |
ccs.var |
Variable name, variable number, or a vector for the case-control status. If |
dist.vars |
Variables numbers or names for computing a distance matrix based on which matching will be performed. Must be
specified if |
strata.var |
Optional stratification variable (such as study center) for matching within strata. A vector of mode integer or factor
if |
size |
Exact size or maximum allowable size of a matched set. This can be an integer greater than 1,
or a vector of such integers that is constant within each level of |
ratio |
Ratio of cases to controls for CC matching. Currently ignored if fixed = FALSE. This can be a positive number,
or a numeric vector that is constant within each level of |
fixed |
Logical. TRUE if "size" should be interpreted as "exact size" and FALSE if it gives "maximal size" of matched sets. The default is FALSE. |
If a data frame and dist.vars
is provided, dist
along with the euclidean metric is used to compute
distances assuming conituous variables. For categorical, ordinal or mixed variables using a custom distance matrix such as that from daisy
is recommended. If strata.var
is provided both case-control (CC) and nearest-neighbor (NN) matching are performed within strata.
size
can be any integer greater than 1 but currently the matching obtained is usable in snp.matched
only if size
is 8 or smaller,
due to memory and speed limitations.
When fixed=FALSE, NN matching is computed using a modified version of hclust
, where clusters are not allowed to grow beyond the specified size
.
CC matching is computed similarly with the further constraint that each cluster must have at least one case and one control. Clusters are then split up into 1:k or k:1
matched sets, where k is at most size
- 1 (known as full matching). For exactly optimal full matching use package optmatch.
When fixed=TRUE, both CC and NN use heuristic fixed-size clustering algorithms. These algorithms start with matches in the periphery of the data space and
proceed inward. Hence prior removal of outliers is recommended.
For CC matching, number of cases in each matched set is obtained by rounding size
* [ratio
/(1+ratio
)] to the nearest integer.
The matching algorithms for fixed=TRUE
are faster, but in case of CC matching large number of case or controls may be discarded with this option.
A list with names "CC", "tblCC", "NN", and "tblNN". "CC" and "NN" are vectors of integer labels defining the matched sets,
"tblCC" and "tblNN" are matrices summarizing the size distribution of matched sets across strata. i
'th row corresponds to matched set
size of i
and columns represent different strata. The order of strata in columns may be different from that in strata.var, if strata.var was
not coded as successive integers starting from 1.
Luca et al. On the use of general control samples for genome-wide association studies: genetic matching
highlights causal variants. Amer Jour Hum Genet, 2008, 82(2):453-63.
Bhattacharjee S, Wang Z, Ciampa J, Kraft P, Chanock S, Yu K, Chatterjee N. Using Principal Components of Genetic Variation for Robust and Powerful Detection of Gene-Gene Interactions in Case-Control and Case-Only studies. American Journal of Human Genetics, 2010, 86(3):331-342.
# Use the ovarian cancer data data(Xdata, package="CGEN") # Add fake principal component columns. # set.seed(123) Xdata <- cbind(Xdata, PC1 = rnorm(nrow(Xdata)), PC2 = rnorm(nrow(Xdata))) # Assign matched set size and case/control ratio stratifying by ethnic group size <- ifelse(Xdata$ethnic.group == 3, 2, 4) ratio <- sapply(Xdata$ethnic.group, switch, 1/2 , 2 , 1) mx <- getMatchedSets(Xdata, CC=TRUE, NN=TRUE, ccs.var="case.control", dist.vars=c("PC1","PC2") , strata.var="ethnic.group", size = size, ratio = ratio, fixed=TRUE) mx$NN[1:10] mx$tblNN # Example of using a dissimilarity matrix using catergorical covariates with # Gower's distance library("cluster") d <- daisy(Xdata[, c("age.group","BRCA.history","gynSurgery.history")] , metric = "gower") # Specify size = 4 as maximum matched set size in all strata mx <- getMatchedSets(d, CC = TRUE, NN = TRUE, ccs.var = Xdata$case.control, strata.var = Xdata$ethnic.group, size = 4, fixed = FALSE) mx$CC[1:10] mx$tblCC
# Use the ovarian cancer data data(Xdata, package="CGEN") # Add fake principal component columns. # set.seed(123) Xdata <- cbind(Xdata, PC1 = rnorm(nrow(Xdata)), PC2 = rnorm(nrow(Xdata))) # Assign matched set size and case/control ratio stratifying by ethnic group size <- ifelse(Xdata$ethnic.group == 3, 2, 4) ratio <- sapply(Xdata$ethnic.group, switch, 1/2 , 2 , 1) mx <- getMatchedSets(Xdata, CC=TRUE, NN=TRUE, ccs.var="case.control", dist.vars=c("PC1","PC2") , strata.var="ethnic.group", size = size, ratio = ratio, fixed=TRUE) mx$NN[1:10] mx$tblNN # Example of using a dissimilarity matrix using catergorical covariates with # Gower's distance library("cluster") d <- daisy(Xdata[, c("age.group","BRCA.history","gynSurgery.history")] , metric = "gower") # Specify size = 4 as maximum matched set size in all strata mx <- getMatchedSets(d, CC = TRUE, NN = TRUE, ccs.var = Xdata$case.control, strata.var = Xdata$ethnic.group, size = 4, fixed = FALSE) mx$CC[1:10] mx$tblCC
Returns a matrix of estimated parameters, standard errors, test statistics, and p-values.
getSummary(fit, sided=2, method=NULL)
getSummary(fit, sided=2, method=NULL)
fit |
The return object from |
sided |
1 or 2 for a 1 or 2 sided p-values. The default is 2. |
method |
Vector of values from "UML", "CML", "EB" or "CCL", "HCL", "CLR". The default is NULL. |
This function returns a matrix similar to summary(glm.obj)$coefficients
, except
the p-values are always computed using the normal distribution.
A matrix with column names "Estimate", "Std.Error", "Z.value", and "Pvalue".
The rownames of the returned matrix will be the names of parms
if parms
is a vector.
parms <- 1:5 cov <- matrix(data=1, nrow=5, ncol=5) getSummary(list(parms=parms, cov=cov)) # Compare to summary() # set.seed(123) n <- 100 y <- rbinom(n, 1, 0.5) x <- cbind(runif(n), rbinom(n, 1, 0.5)) fit <- glm(y ~ x, family=binomial()) sum <- summary(fit) sum$coefficients getSummary(fit)
parms <- 1:5 cov <- matrix(data=1, nrow=5, ncol=5) getSummary(list(parms=parms, cov=cov)) # Compare to summary() # set.seed(123) n <- 100 y <- rbinom(n, 1, 0.5) x <- cbind(runif(n), rbinom(n, 1, 0.5)) fit <- glm(y ~ x, family=binomial()) sum <- summary(fit) sum$coefficients getSummary(fit)
Computes a univariate or multivariate Wald test
getWaldTest(fit, parmNames, method=NULL)
getWaldTest(fit, parmNames, method=NULL)
fit |
Return object from |
parmNames |
Vector of parameters to test. This vector can be a character vector of parameter names or a numeric vector of positions. No default. |
method |
Vector of values from "UML", "CML", "EB" or "CCL", "HCL", "CLR". The default is NULL. |
If fit
is a list, then "parms" should be the vector of coefficients, and
"cov" should be the covariance matrix. If parmNames
is a character vector,
then "parms" should be a named vector and the names must match the rownames and
colnames of "cov". A chi-squared test is computed.
List containing the value of the test statistic (test
), degrees of freedom (df
),
and p-value (pvalue
).
# set.seed(123) n <- 100 y <- rbinom(n, 1, 0.5) x <- runif(n*5) dim(x) <- c(n, 5) x <- data.frame(x) colnames(x) <- c("x", "x2", "x3", "z", "z2") fit <- glm(y ~ ., data=x, family=binomial()) # Chi-squared test getWaldTest(fit, c("x", "z")) beta <- c(-2.5, 2.5) cov <- diag(1:2) getWaldTest(list(parms=beta, cov=cov), 1:2)
# set.seed(123) n <- 100 y <- rbinom(n, 1, 0.5) x <- runif(n*5) dim(x) <- c(n, 5) x <- data.frame(x) colnames(x) <- c("x", "x2", "x3", "z", "z2") fit <- glm(y ~ ., data=x, family=binomial()) # Chi-squared test getWaldTest(fit, c("x", "z")) beta <- c(-2.5, 2.5) cov <- diag(1:2) getWaldTest(list(parms=beta, cov=cov), 1:2)
Performs analyses for a scan of SNPs.
GxE.scan(snp.list, pheno.list, op=NULL)
GxE.scan(snp.list, pheno.list, op=NULL)
snp.list |
See |
pheno.list |
See |
op |
See details for this list of options. The default is NULL. |
For proper anlaysis of imputed SNPs, the only option that should be used is
model
=3 (snp.score
)
See the vignette vignette_GxE
for examples of running this function.
To use this function, the data must be stored in files as defined in
snp.list
and pheno.list
. See the examples on how to
create these lists. The format of the genotype data must be a format where the SNPs
are rows (snp.list$format
="impute", "ldat", "tped") or a format where either the GLU software
or PLINK software can read and transform the data.
Options for setting the path to GLU or PLINK are found in snp.list
.
The genotype data is read in from the file
snp.list$file
, and the variables for the main effects and interactions
are read in from the file pheno.list$file
.
The subjects to be included in the analysis are the subjects with matching ids in the phenotype and
genotype data.
Users can easily run their own customized scan by setting the option model
to 0 and setting
the option scan.func
to their own scan function. The vignette "vignette_GxE" has examples of
user-defined scans.
Options list op:
Below are the names for the options list op
. All names have default values
if they are not specified.
model
0-4, where 1 = snp.logistic
, 2 = additive.test
,
3 = snp.score
, 4 = snp.matched
and 0 is for a user defined
function. The default is 1.
out.file
File name to save the results. If NULL, then the output file will be
created in the working directory as
paste(getwd(), "/GxE.scan.output.txt", sep="")
.
scan.func.op
List of specific options for the scan function. For model
=1-4, see
snp.logistic
, additive.test
,
snp.score
or snp.matched
for these options.
The default is NULL.
geno.counts
0 or 1 to write the genotype frequency counts to the output file.
The default is 1.
geno.MAF
0 or 1 to write the SNP MAF to the output file.
The default is 1.
geno.missRate
0 or 1 to write the SNP missing rate to the output file.
The default is 1.
output.methods
Character vector of "UML", "CML", "EB" to only output results for
specific methods. This option is only for snp.logistic
and snp.score
.
The default is c("UML", "CML", "EB").
output.parms
0 or 1 to output parameter estimates.
This option is only for snp.logistic
and snp.score
.
The default is 0.
output.sandwich
0 or 1 to output sandwich covariance estimates.
This option is only for snp.score
.
The default is 0.
Advanced options:
scan.func
(For model
= 0). The name of the user-defined scan function.
This function must have 2 input arguments and return a named list or named vector.
See the vignette "vignette_GxE" for examples.
The default is NULL.
scan.setup.func
(For model
= 0). NULL or the name of the user-defined function
called after the phenotype data is read in and before the genotype data is read. This function
is not required for model
= 0.
This function must have 2 input arguments and return NULL or a named list.
See the vignette "vignette_GxE" for examples.
The default is NULL.
Output variable names:
The output file will contain columns for the SNP, MAF, missing rate, and genotype frequency counts. Other
columns are specific for the specified model
. A column denoted by *.Inter.Pvalue
is a p-value
from a Wald test involving the interaction terms of the SNP and exposure variables.
A column denoted by *.Omnibus.Pvalue
is a p-value
from a Wald test involving the main effect of the SNP and interaction terms of the SNP and exposure variables.
The name of the output file containing the results. See the option out.file
.
See details for the column names in the output file.
GxE.scan.partition
, GxE.scan.combine
# Define the list for the genotype data. snp.list <- list() snp.list$file <- system.file("sampleData", "geno_data.ldat.gz", package="CGEN") snp.list$file.type <- 7 snp.list$delimiter <- "\t" snp.list$in.miss <- " " # Only process the first 5 SNPs in the file snp.list$start.vec <- 1 snp.list$stop.vec <- 6 # Define pheno.list pheno.list <- list() pheno.list$file <- system.file("sampleData", "Xdata.txt", package="CGEN") pheno.list$file.type <- 3 pheno.list$delimiter <- "\t" pheno.list$id.var <- "id" # Define the variables in the model pheno.list$response.var <- "case.control" pheno.list$strata.var <- "ethnic.group" pheno.list$main.vars <- c("age.group", "oral.years", "n.children") pheno.list$int.vars <- "n.children" # Define the list of options op <- list(out.file="out.txt") # For this model, all variables are continuous # temp <- GxE.scan(snp.list, pheno.list, op=op)
# Define the list for the genotype data. snp.list <- list() snp.list$file <- system.file("sampleData", "geno_data.ldat.gz", package="CGEN") snp.list$file.type <- 7 snp.list$delimiter <- "\t" snp.list$in.miss <- " " # Only process the first 5 SNPs in the file snp.list$start.vec <- 1 snp.list$stop.vec <- 6 # Define pheno.list pheno.list <- list() pheno.list$file <- system.file("sampleData", "Xdata.txt", package="CGEN") pheno.list$file.type <- 3 pheno.list$delimiter <- "\t" pheno.list$id.var <- "id" # Define the variables in the model pheno.list$response.var <- "case.control" pheno.list$strata.var <- "ethnic.group" pheno.list$main.vars <- c("age.group", "oral.years", "n.children") pheno.list$int.vars <- "n.children" # Define the list of options op <- list(out.file="out.txt") # For this model, all variables are continuous # temp <- GxE.scan(snp.list, pheno.list, op=op)
Combines the output files from running GxE.scan on a cluster.
GxE.scan.combine(out.file, dir.or.files, pattern="GxEout_")
GxE.scan.combine(out.file, dir.or.files, pattern="GxEout_")
out.file |
Name of the output file of combined results. No default. |
dir.or.files |
Directory containing the output files from |
pattern |
Character string to match file names when |
This function would be called after all the submitted jobs from
GxE.scan.partition
have finished running.
It is assumed that the files to be combined all have a header of variable names.
A character vector of the files combined.
out.file <- "/data/results/all_output.txt" dir <- "/data/out" # GxE.scan.combine(out.file, dir)
out.file <- "/data/results/all_output.txt" dir <- "/data/out" # GxE.scan.combine(out.file, dir)
Creates job files for running GxE.scan on a parallel processing system.
GxE.scan.partition(snp.list, pheno.list, op=NULL)
GxE.scan.partition(snp.list, pheno.list, op=NULL)
snp.list |
See |
pheno.list |
See |
op |
See details for this list of options. The default is NULL. |
This function will create files needed for running a GWAS scan on a computing cluster.
The user must know how to submit jobs and know how to use their particular cluster.
On many clusters, the command for submitting a job is "qsub".
The scan is partitioned into smaller jobs by either setting the values for snp.list$start.vec
and
snp.list$stop.vec
or by setting the value for snp.list$include.snps
. The partitioning is done
so that each job will process an equal number of SNPs.
In the output directory (see option out.dir
), three types of files will be created. One type of file
will be the
R program file containing R statements defining snp.list
, pheno.list
and op
for the GxE.scan
function. These files have the ".R" file extension.
Another type of file will be the job file which calls the R program file. These files are named paste(op$out.dir, "job_", op$id.str, 1:op$n.jobs, sep="")
The third type of file is a single file containing the names of all the job files. This file has the prefix "Rjobs_".
This function will automatically set the name of the output file created by GxE.scan
to a file in the op$out.dir
directory with the prefix "GxEout_".
Options list op:
Below are the names for the options list op
. All names have default values
if they are not specified.
n.jobs
The (maximum) number of jobs to run.
The default is 100.
out.dir
Directory to save all files. If NULL, then the files will be
created in the working directory getwd
.
GxE.scan.op
List of options for the GxE.scan
function.
The default is NULL.
R.cmd
Character string for calling R.
The default is "R –vanilla".
begin.commands.R
Character vector of R statements to be placed at the top of each R program file.
For example, begin.commands.R
=c("rm(list=ls(all=TRUE))", "gc()", 'library(CGEN, lib.loc="/home/Rlibs/")')
The default is "library(CGEN)".
qsub.cmd
Character string for the command to submit a single job.
The default is "qsub".
begin.commands.qsub
Character vector of statements to be placed at the top of each job file.
For example, begin.commands.qsub
="module load R".
The default is NULL.
id.str
A character string to be appended to the file names.
The default is "".
snp.list
The objects start.vec
and stop.vec
in snp.list
are set automatically, so they do
not need to be set by the user.
In general, it is more efficient in terms of memory usage and speed to have the genotype data
partitioned into many files. Thus, snp.list$file
can not only be set to a single file but also set to
a character vector of the partitioned files when calling this function. In this case, the number of jobs
to create (op$n.jobs
) must be greater than or equal to the number of partitioned files.
An object in snp.list
that is unique to the GxE.scan.partition
function is
nsnps.vec
. Each element of snp.list$nsnps.vec
is the number of SNPs in each file of
snp.list$file
.
If nsnps.vec
is not specified and snp.list$file
contains more than one file,
then each job will process an entire file in snp.list$file
.
For the scenarios when the genotype data must be transformed and the data is contained in a single file, then
snp.list$include.snps
should also be set. This will create a separate list of SNPs for each job to
process.
The name of the file containing names of the job files to be submitted. See details.
# Define the list for the genotype data. There are 50 SNPs in the TPED file. snp.list <- list(nsnps.vec=50, format="tped") snp.list$file <- system.file("sampleData", "geno_data.tped.gz", package="CGEN") snp.list$subject.list <- system.file("sampleData", "geno_data.tfam", package="CGEN") # Define pheno.list pheno.list <- list(id.var=c("Family", "Subject"), delimiter="\t", header=1, response.var="CaseControl") pheno.list$file <- system.file("sampleData", "pheno.txt", package="CGEN") pheno.list$main.vars <- ~Gender + Exposure pheno.list$int.vars <- ~Exposure pheno.list$strata.var <- "Study" # Define the list of options. # Specifying n.jobs=5 will let each job process 10 SNPs. op <- list(n.jobs=5, GxE.scan.op=list(model=1)) # GxE.scan.partition(snp.list, pheno.list, op=op)
# Define the list for the genotype data. There are 50 SNPs in the TPED file. snp.list <- list(nsnps.vec=50, format="tped") snp.list$file <- system.file("sampleData", "geno_data.tped.gz", package="CGEN") snp.list$subject.list <- system.file("sampleData", "geno_data.tfam", package="CGEN") # Define pheno.list pheno.list <- list(id.var=c("Family", "Subject"), delimiter="\t", header=1, response.var="CaseControl") pheno.list$file <- system.file("sampleData", "pheno.txt", package="CGEN") pheno.list$main.vars <- ~Gender + Exposure pheno.list$int.vars <- ~Exposure pheno.list$strata.var <- "Study" # Define the list of options. # Specifying n.jobs=5 will let each job process 10 SNPs. op <- list(n.jobs=5, GxE.scan.op=list(model=1)) # GxE.scan.partition(snp.list, pheno.list, op=op)
The list to describe the locus map data for Manhattan.plot
.
The format is: List of 8
File containing the locus map data. This file must contain at least three columns: a column for the SNP names, a column for the chromosomes, and a column for the location of the SNP on the chromosome. The location should be numeric values. No default.
1, 3 or 4 (see details). The default is 3.
The delimiter used in the files. The default is "\t" (a tab).
0 or 1 if the file contains a header of variable names. The default is 0.
Variable name (e.g. rs number) or column number of the SNP (locus) variable. No default.
Variable name (e.g. chromosome number) or column number of the chromosome variable. No default.
Variable name or column number of the location variable, which denotes the SNP's position on the chromosome. This variable should be numeric. No default.
In this list, file
must be specified. The types of
files are described below.
Type 1
An .rda file where the saved object was a data frame.
Type 3
A delimited file.
Locus map data for Manhattan.plot
LocusMapData.txt is a tab delimited file that contains sample chromosome and location information. The first 5 rows look like:
SNP | CHROMOSOME | LOCATION |
rs11102647 | 1 | 113783261 |
rs6695241 | 1 | 172626514 |
rs12567796 | 1 | 18262009 |
rs2810583 | 1 | 41549436 |
# Load and print the first 5 rows data(LocusMapData, package="CGEN") LocusMapData[1:5, ]
# Load and print the first 5 rows data(LocusMapData, package="CGEN") LocusMapData[1:5, ]
Creates a Manhattan plot
Manhattan.plot(infile, plot.vars, locusMap.list, op=NULL)
Manhattan.plot(infile, plot.vars, locusMap.list, op=NULL)
infile |
Output file from |
plot.vars |
Character vector of the variables in |
locusMap.list |
See |
op |
List of options (see details). The default is NULL. |
Plots p-values on a minus log base 10 scale versus the locations of the SNPs on each chromosome.
Options list op:
Below are the names for the options list op
. All names have default values
if they are not specified.
add
A number to add spacing between the chromosomes. The default is 0.
cex
Magnification factor for the plotting symbol. The default is 1.
cex.axis
Magnification factor for the axes. The default is 1.
cex.lab
Magnification factor for the axes labels. The default is 1.
cex.main
Magnification factor for the title. The default is 1.
colors
Character vector of colors to use in the plot.
See colors
for all possible colors. The default is NULL.
figs
Two element vector for the number of rows and columns if length(plot.vars
) > 1
and onePlot
= 0. The default is NULL.
min.p
All p-values less than min.p
will be set to min.p
. The default is 1e-30.
onePlot
0 or 1 to plot all of plot.vars
on the same screen. The default is 0.
pch
Vector of plotting symbols to use. See points
for
the different plotting symbols.
splitScreen
0 or 1 to split the plot into two seperate parts.
The default is 0.
subset
Vector of chromosomes to plot. The default is NULL.
tcl
The length of tick marks as a fraction of the height of a line of text. The default is -0.5.
x.las
0-3 for axis labels. 0=parallel, 1=horizontal, 2=perpendicular, 3=vertical
The default is 2.
yaxis.range
Vector of length 2 to set the limits for the y-axis.
The limits should be on the original scale. The default is NULL.
Returns nothing. Creates a Manhattan plot.
# Load the data containing the chromosomes and locations data(LocusMapData, package="CGEN") # For illustrative purposes, add some hypothetical p-values to x # set.seed(123) LocusMapData[, "pvalue"] <- runif(nrow(LocusMapData)) # Define the input list locusMap.list locusList <- list(snp.var="SNP", chrm.var="CHROMOSOME", loc.var="LOCATION") # Create the plot Manhattan.plot(LocusMapData, "pvalue", locusList)
# Load the data containing the chromosomes and locations data(LocusMapData, package="CGEN") # For illustrative purposes, add some hypothetical p-values to x # set.seed(123) LocusMapData[, "pvalue"] <- runif(nrow(LocusMapData)) # Define the input list locusMap.list locusList <- list(snp.var="SNP", chrm.var="CHROMOSOME", loc.var="LOCATION") # Create the plot Manhattan.plot(LocusMapData, "pvalue", locusList)
The list to describe the covariate and outcome data for GxE.scan
.
The format is: List of 14
Covariate data file. This file must have variable names, two
of which being an id variable and a response variable (see
id.var
and response.var
).
No default.
Name of the id variable(s). No default.
Name of the binary response variable. This variable must be coded as 0 and 1. No default.
Stratification variable name or a formula for variables in file
.
See the individual model documentation for the allowable stratifications.
The default is NULL so that all observations
belong to the same strata.
Character vector of variables names or a formula for variables in file
that will be included in the model as main effects.
The default is NULL.
Character vector of variable names or a formula for variables in file
that will be included in the model as interactions with each SNP
in the genotype data.
The default is NULL.
1, 3, 4. 1 is for an R object file created with the
save()
function. 3 is for a table that will be read in
with read.table()
. 4 is for a SAS data set.
The default is 3.
The delimiter in file
.
The default is "".
Vector of variable names to convert into factors. The default is NULL.
Vector of character strings to define the missing values. This option
corresponds to the option na.strings
in read.table()
.
The default is "NA".
List of sublists to subset the phenotype data for analyses.
Each sublist should contain the names "var", "operator" and "value" corresponding
to a variable name, operator and values of the variable.
Multiple sublists are logically connected by the AND operator. For example,
subsetData=list(list(var="GENDER", operator="==", value="MALE"))
will only include subjects with the string "MALE" for the GENDER variable.
subsetData=list(list(var="AGE", operator=">", value=50),
list(var="STUDY", operator="%in%", value=c("A", "B", "C")))
will include subjects with AGE > 50 AND in STUDY A, B or C.
The default is NULL.
Name of the cc.var
variable used in snp.matched
.
The default is NULL.
Name of the nn.var
variable used in snp.matched
.
The default is NULL.
In this list, file
, id.var
, and response.var
must be specified.
The variable id.var
is the link between the covariate data and the genotype data.
For each subject id, there must be the same subject id in the genotype data for that subject
to be included in tha analysis. If the genotype data is in a PLINK format, then id.var
must be of length 2 corresponding the the family id and subject id.
Missing data: If any of the variables defined in main.vars
, int.vars
,
strata.var
, or response.var
contain missing values, then those subjects
will be removed from the covariate and outcome data. After the subjects with missing values
are removed, the subject ids are matched with the genotype data.
Prints an object returned from snp.logistic
or snp.matched
printEffects(obj, op=NULL)
printEffects(obj, op=NULL)
obj |
The return object from |
op |
Options list with names "digits" and "method" (see details). The default is NULL. |
Below are the names for the options list op
. All names have default values
if they are not specified.
digits
Integer: Number of significant digits to print. The default is 2.
method
Vector of values from "UML", "CML", "EB" or "CCL", "HCL", "CLR". The default is NULL.
Returns NULL
# Use the ovarian cancer data data(Xdata, package="CGEN") # Fit using a stratification variable fit <- snp.logistic(Xdata, "case.control", "BRCA.status", main.vars=c("oral.years", "n.children"), int.vars=c("oral.years", "n.children"), strata.var="ethnic.group") # Compute the effects effects <- snp.effects(fit, "oral.years", var.levels=c(0, 2, 3)) printEffects(effects)
# Use the ovarian cancer data data(Xdata, package="CGEN") # Fit using a stratification variable fit <- snp.logistic(Xdata, "case.control", "BRCA.status", main.vars=c("oral.years", "n.children"), int.vars=c("oral.years", "n.children"), strata.var="ethnic.group") # Compute the effects effects <- snp.effects(fit, "oral.years", var.levels=c(0, 2, 3)) printEffects(effects)
Create a quantile-quantile plot
QQ.plot(pvalues, op=NULL)
QQ.plot(pvalues, op=NULL)
pvalues |
Vector or matrix of p-values. No default. |
op |
List of options (see details). The default is NULL. |
Plots the ranked p-values against their expected order statistics on
a minus log base 10 scale. If pvalues
is a matrix, then multiple
qq plots will be produced, each on a seperate screen.
Options list op:
Below are the names for the options list op
. All names have default values
if they are not specified.
cex
Magnification factor for the plotting symbol. The default is 1.
cex.axis
Magnification factor for the axes. The default is 1.
cex.lab
Magnification factor for the axes labels. The default is 1.
cex.main
Magnification factor for the title. The default is 1.
color
The color of the plot and title. The default is "blue".
figs
Two element vector for the number of rows and columns if pvalues
is a matrix. The default is NULL.
min.p
All p-values less than min.p
will be set to min.p
. The default is 1e-16.
pch
The plotting symbol. The default is 21 (circle).
title
Character string or vector for the title of each plot.
If pvalues
is a matrix, then the column names will be used if title
is NULL.
The default is NULL.
xlim
The x-axis limits. The default is determined from pvalues
.
ylim
The y-axis limits. The default is determined from pvalues
.
Returns 0. Creates a QQ plot.
# set.seed(123) p <- runif(1000) QQ.plot(p)
# set.seed(123) p <- runif(1000) QQ.plot(p)
Computes joint and stratified effects of the SNP and another variable based on a fitted model.
snp.effects(fit, var, var.levels=c(0, 1), method=NULL)
snp.effects(fit, var, var.levels=c(0, 1), method=NULL)
fit |
Return object from |
var |
Name of the second variable to compute the effects for. This variable can be
a dummy variable, continuous variable, or a factor. Note that if this variable enters the model
as both a main effect and interaction, then it must enter the model the same way as a main effect
and interaction for the effects to be computed correctly.
For example, if |
var.levels |
(For continuous |
method |
Vector of values from "UML", "CML", "EB" or "CCL", "HCL", "CLR". The default is NULL. |
The joint and stratified effects are computed for each method in fit
.
The stratified effects are the sub-group effect of the SNP stratified by
var
and the sub-group effect of var
stratified by the SNP.
Definition of joint and stratified effects:
Consider the model:
Let 0 be the baseline for SNP and the baseline for X. Then the joint effect
for SNP = s and X = x relative to SNP = 0 and X =
is
The stratified effect of the SNP relative to SNP = 0 given X = x is
The stratified effect of var
relative to X = x given SNP = s is
A convenient way to print the returned object to view the effects tables is with the function printEffects
.
If fit
is of class snp.logistic
, then the return object is a list of with names "UML", "CML", and "EB".
If fit
is of class snp.matched
, then the return object is a list of with names "CLR", "CCL", and "HCL".
Each sublist contains joint effects, stratified effects, standard errors and confidence intervals.
The sub-group effect of the SNP stratified by var
is in the list "StratEffects", and the
sub-group effect of var
stratified by the SNP is in the list "StratEffects.2".
# Use the ovarian cancer data data(Xdata, package="CGEN") # Fit using a stratification variable fit <- snp.logistic(Xdata, "case.control", "BRCA.status", main.vars=c("oral.years", "n.children"), int.vars=c("oral.years", "n.children"), strata.var="ethnic.group") # Compute the effects effects <- snp.effects(fit, "oral.years", var.levels=0:5)
# Use the ovarian cancer data data(Xdata, package="CGEN") # Fit using a stratification variable fit <- snp.logistic(Xdata, "case.control", "BRCA.status", main.vars=c("oral.years", "n.children"), int.vars=c("oral.years", "n.children"), strata.var="ethnic.group") # Compute the effects effects <- snp.effects(fit, "oral.years", var.levels=0:5)
Creates a plot of the effects returned from snp.effects
snp.effects.plot(obj.list, op=NULL)
snp.effects.plot(obj.list, op=NULL)
obj.list |
Return object or list of return objects from |
op |
List of options (see details). The default is NULL. |
Plots the effects returned from snp.effects
. By default, the effects in StratEffects
for each method will be plotted. The side of the effect will have a sawtooth edge if the effect goes beyond
the limits of the plot.
Options list op:
Below are the names for the options list op
. All names have default values
if they are not specified.
method
Character vector of the values "UML", "CML", "EB", "HCL", "CCL", "CLR".
The default is all methods will be plotted.
type
One of "JointEffects", "StratEffects", "StratEffects.2".
The default is StratEffects.
ylim
NULL or a 2-element numeric vector specifying the y-axis limits for all plots.
If not specified, different plots will be on different scales.
The default is NULL.
legend
See legend
.
Set to NA for no legend to appear.
The default is NULL.
split.screen
NULL or a 2-element vector for partitioning the plot window.
This option is only valid for inputing a list of objects.
The default is NULL.
colors
Character vector of colors to use in the plot.
See colors
for all possible colors. The default is NULL.
levels1
Vector of levels for the SNP variable to plot. When plotting more than one method,
levels1
has the default value of 1. Otherwise, the default is NULL.
levels2
Vector of levels to plot for the variable var
(in snp.effects
).
The default is NULL.
addCI
0 or 1 to add 95% confidence intervals to the plot. The confidence intervals appear
as un-filled boxes around each odds-ratio.
The default is 0.
Returns nothing. Creates a snp effect plot.
# Use the ovarian cancer data data(Xdata, package="CGEN") # Add some fake SNPs # set.seed(636) Xdata[, "rs123"] <- rbinom(nrow(Xdata), 1, 0.4) Xdata[, "rs456"] <- rbinom(nrow(Xdata), 1, 0.4) Xdata[, "rs789"] <- rbinom(nrow(Xdata), 1, 0.4) snpVars <- c("BRCA.status", "rs123", "rs456", "rs789") objects <- list() for (i in 1:length(snpVars)) { fit <- snp.logistic(Xdata, "case.control", snpVars[i], main.vars=c("oral.years", "n.children"), int.vars=c("oral.years", "n.children"), strata.var="ethnic.group") # Compute the effects objects[[i]] <- snp.effects(fit, "oral.years", var.levels=0:4) } # Plot snp.effects.plot(objects) # Plot all on the same scale #snp.effects.plot(objects, op=list(ylim=c(0.9, 1.4), legend=list(x="bottom"))) # Plot all the joint effects of rs789 for the CML method and add confidence intervals #snp.effects.plot(objects[[4]], op=list(method="CML", type="JointEffects", # legend=list(x="bottomleft", inset=0), ylim=c(0.45, 1.3), # colors=c("blue", "aquamarine", "skyblue"), addCI=1))
# Use the ovarian cancer data data(Xdata, package="CGEN") # Add some fake SNPs # set.seed(636) Xdata[, "rs123"] <- rbinom(nrow(Xdata), 1, 0.4) Xdata[, "rs456"] <- rbinom(nrow(Xdata), 1, 0.4) Xdata[, "rs789"] <- rbinom(nrow(Xdata), 1, 0.4) snpVars <- c("BRCA.status", "rs123", "rs456", "rs789") objects <- list() for (i in 1:length(snpVars)) { fit <- snp.logistic(Xdata, "case.control", snpVars[i], main.vars=c("oral.years", "n.children"), int.vars=c("oral.years", "n.children"), strata.var="ethnic.group") # Compute the effects objects[[i]] <- snp.effects(fit, "oral.years", var.levels=0:4) } # Plot snp.effects.plot(objects) # Plot all on the same scale #snp.effects.plot(objects, op=list(ylim=c(0.9, 1.4), legend=list(x="bottom"))) # Plot all the joint effects of rs789 for the CML method and add confidence intervals #snp.effects.plot(objects[[4]], op=list(method="CML", type="JointEffects", # legend=list(x="bottomleft", inset=0), ylim=c(0.45, 1.3), # colors=c("blue", "aquamarine", "skyblue"), addCI=1))
The list to describe the genotype data for GxE.scan
.
The format is: List of 14
File to use. No default.
Genotype format (see details).
Typical values are "bed", "tped", "impute", "ldat", "lbat", etc.
The default is determined from the file
extension.
List to describe the subject ids stored in a file. This list is only needed
when the genotype file does not contain the subject ids (for example with PLINK files).
The order of the subject ids is subject.list$file
must match the order in the genotype data. See subject.list
.
The default is NULL.
Starting row of file
to begin processing SNPs. The default is 1.
Last row of file
to finish processing SNPs.
Use any value < 1 so that all SNPs from rows start.vec
to the end of the file
will be analyzed.
The default is -1.
The delimiter used in file
. The default is determined from the file
format.
Vector of values to denote the missing values in file
.
The default is determined from the file
format.
Vector of codes used for the heterozygous genotype. If NULL, then it is assumed that the heterozygous genotype is of the form "AB", "Aa", "CT", ... etc, ie a 2-character string with different characters (case sensitive). The default is NULL.
Options only used with GxE.scan
:
File, list or character vector to define which SNPs should be included in the analysis.
If a file, then the file should contain a single column of SNP ids to include.
More generally, if the SNPs to be included are in a file with multiple columns, then
include.snps
can be a list of type subject.list
.
If it is a character vector, then it should be a vector of SNP ids.
This option can also be used with the options start.vec
and stop.vec
(see details).
The default is NULL.
Command for running the PLINK software to transform certain file
formats (see details).
Set PLINK to "" if PLINK is not available or if you do not want PLINK to be used.
The PLINK software can be found at http://pngu.mgh.harvard.edu/~purcell/plink/.
The default is "plink".
Command for running the GLU software to transform certain file
formats (see details).
Set GLU to "" if GLU is not available or if you do not want GLU to be used.
The GLU software can be found at http://code.google.com/p/glu-genetics/.
The default is "glu".
In this list, file
must be specified, and format
should
be specified. If not, then the program will attempt to guess the correct format
of the genotype data from the file extension of file
.
If format
is a format that GxE.scan
is not set up to read directly (such as
"bed", "lbat", "ped"), then either PLINK or GLU
will be called to transform the data into either a "tped" or "ldat" format.
When the option include.snps
is specified as a file, then the options start.vec
and
stop.vec
will be applied to the SNPs in this file. For example, suppose we have the genotype file
snps.bed which is the PLINK "bed" format. We can set include.snps
to the corresponding ".bim" file:
include.snps <- list(file="snps.bim", id.var=2, header=0, delimiter="\t").
Then the included SNPs in the analysis will be
the SNPs in rows start.vec
to stop.vec
of file "snps.bim".
Other options such as delimiter
and in.miss
do not need to be specified, because they can
be determined from the genotype data format.
If the SNPs are coded in the standard (0,1,2) coding, then
set heter.codes
to 1 (the heterozygous genotype).
# Example snp.list for a PLINK binary pedigree file when using GxE.scan ## Not run: pathToPLINK <- "c:/PLINK/plink-1.07-dos/plink.exe" snp.file <- "c:/data/project1/lungCancer.bed" subject.list <- "c:/data/project1/lungCancer.fam" snp.list <- list(file=snp.file, format="bed", PLINK=pathToPLINK, subject.list=subject.list) ## End(Not run) # Suppose the genotype data is an output genotype file from th IMPUTE2 software # The below list is for processing the file. ## Not run: snp.list <- list(file="C:/temp/data/chr11_1.imputed.txt.gz", delimiter=" ", format="impute") ## End(Not run)
# Example snp.list for a PLINK binary pedigree file when using GxE.scan ## Not run: pathToPLINK <- "c:/PLINK/plink-1.07-dos/plink.exe" snp.file <- "c:/data/project1/lungCancer.bed" subject.list <- "c:/data/project1/lungCancer.fam" snp.list <- list(file=snp.file, format="bed", PLINK=pathToPLINK, subject.list=subject.list) ## End(Not run) # Suppose the genotype data is an output genotype file from th IMPUTE2 software # The below list is for processing the file. ## Not run: snp.list <- list(file="C:/temp/data/chr11_1.imputed.txt.gz", delimiter=" ", format="impute") ## End(Not run)
Performs logistic regression including a particular SNP (G) and a set of covariates (X) that could include environmental covariates or/and other genetic variables. Included are three analysis options: (i) Unconstrained maximum-likelihood: This method is equivalent to prospective logistic regression analysis and corresponds to maximum-likelihood analysis of case-control data allowing the joint distribution of all the factors (the SNP of interest and all other covariates) of the model to be completely unrestricted (non-parametric) (ii) Constrained maximum-likelihood: This method performs maximum-likelihood analysis of case-control data under the assumption of HWE and indepenence between the SNP and other factors of the model.The analysis allows the assumptions of HWE and independence to be valid only conditional on certain stratification variables (S), such as self reported ethnicity or principal compoenets of population stratification. (iii) Empirical-Bayes: This method uses an empirical-Bayes type "shrinkage estimation" technique to trade-off bias and variance between the constrained and unconstrained maximum-likelihood estimators.
snp.logistic(data, response.var, snp.var, main.vars=NULL, int.vars=NULL, strata.var=NULL, op=NULL, additive.trend=FALSE)
snp.logistic(data, response.var, snp.var, main.vars=NULL, int.vars=NULL, strata.var=NULL, op=NULL, additive.trend=FALSE)
data |
Data frame containing all the data. No default. |
response.var |
Name of the binary response variable coded as 0 (controls) and 1 (cases). No default. |
snp.var |
Name of the SNP variable, which must be coded 0-1-2 (or 0-1). The SNP will be included as a main effect in the model. No default. |
main.vars |
Character vector of variable names or a formula for all covariates of interest which need to be included in the model as main effects. The default is NULL, so that only the SNP variable will be included as a main effect in the model. |
int.vars |
Character vector of variable names or a formula for all covariates of interest that will interact with the SNP variable. The default is NULL, so that no interactions will be in the model. |
strata.var |
Name of the stratification variable or a formula (see details for more info).
If |
op |
A list with names |
additive.trend |
Logical, FALSE by default. No need to set. It is used by additive.test() function for additive trend implementation. |
Note: Non-dummy continuous variables should be scaled for stability of the algorithm.
The scale
function can be used for this.
The data is first fit using standard logistic regression. The estimated
parameters from the standard logistic regression are then used as the initial
estimates for the constrained model. For this,
the optim()
function is used to compute the maximum likelihood estimates and
the estimated covariance matrix. The empirical Bayes estimates are then computed by
combining both sets of estimated parameters (see below). The "strata" option, that is relevent for the
CML and EB method, allows
the assumption of HWE and G-X independence to be valid only conditional on a given set of other factors.
If a single categorical variable name is provided, then the unique levels of the variable will be used to define categorical strata.
Otherwise it is assumed that strata.var
defines a parametric model for variation of allele
frequency of the SNP as a function of the variables included. No assumption
is made about the relationship between X and S. Typically, S would include self reported ethnicity,
study, center/geographic region and principal components of population stratification. The CML method with the "strata"
defined by principal compoenents of population stratification can be viewed as a generalization of adjusted case-only method
described in Bhattacharjee et al. (2010). More details of the individual methods follow.
Definition of the likelihood under the gene-environment independence assumption:
Let D = 0, 1 be the case-control status, G = 0, 1, 2 denote the SNP genotype, S denote the stratification variable(s) and X denote the set of all other factors to be included in the regression model. Suppose the risk of the disease (D), given G, X and S can be described by a logistic regression model of the form
where Z is the entire design matrix (including G, X, possibly S and their interaction with X) and is the vector of associated regression coefficients.
The CML method assumes Pr(G|X,S)=Pr(G|S), i.e., G and X are conditionally independent given S. The current implementation of the CML method also assume the SNP genotype frequency follows
HWE given S=s, although this is not necessary in general. Thus, if
denotes the allele frequency given S=s, then
If , then
and
Chatterjee and Carroll (2005) showed that under the above constraints, the maximum-likelihood estimate for the coefficients under case-control design can be obtained based on a simple conditional likelihood of the form
where the sum is taken over the 6 combinations of d and g and
If S is a single categorical variable, then a separate
is allowed for each S=s. Otherwise it is assumed
, where
is the design matrix associated with the stratification and
is the vector of stratification parameters.
If for example, S is specified as "strata=~PC1+PC2+...PCK" where PCk's denote principal components of population stratification, then it is assumed that the allele frequency of the SNP varies in directions of the different principal components in a logistic linear fashion.
Definition of the empirical bayes estimates:
Let be the parameter estimates from standard logistic regression, and
let
be the estimates under the gene-environment independence assumption.
Let
, and
be the vector of variances of
.
Define diagonal matrices of weights to be
and
,
where
is the elementwise product of the vector
.
Now, the empirical bayes parameter estimates are
For the estimated covariance matrix, define the diagonal matrix
where again the exponentiation is the elementwise product of the vectors.
If is the pxp identity matrix and we define the px2p matrix
, then
the estimated covariance matrix is
The covariance term is obtained using
an influence function method (see Chen YH, Chatterjee N, and Carroll R. for details about
the above formulation of the empirical-Bayes method).
Options list:
Below are the names for the options list op
. All names have default values
if they are not specified.
genetic.model
0-3: The genetic model for the SNP. 0=additive, 1=dominant,
2=recessive, 3=general (co-dominant).
reltol
Stopping tolerance. The default is 1e-8.
maxiter
Maximum number of iterations. The default is 100.
optimizer
One of "BFGS", "CG", "L-BFGS-B", "Nelder-Mead", "SANN".
The default is "BFGS".
A list containing sublists with names UML (unconstrained maximum likelihood), CML (constrained maximum likelihood), and EB (empirical Bayes). Each sublist contains the parameter estimates (parms) and covariance matrix (cov). The lists UML and CML also contain the log-likelihood (loglike). The list CML also contains the results for the stratum specific allele frequencies under the HWE assumption (strata.parms and strata.cov). The EB sublist contains the joint UML-CML covariance matrix.
Mukherjee B, Chatterjee N. Exploiting gene-environment independence in analysis of case-control studies: An empirical Bayes approach to trade-off between bias and efficiency. Biometrics 2008, 64(3):685-94.
Mukherjee B et al. Tests for gene-environment interaction from case-control data: a novel study of type I error, power and designs. Genetic Epidemiology, 2008, 32:615-26.
Chatterjee, N. and Carroll, R. Semiparametric maximum likelihood estimation exploting gene-environment independence in case-control studies. Biometrika, 2005, 92, 2, pp.399-418.
Chen YH, Chatterjee N, Carroll R. Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies. Journal of the American Statistical Association, 2009, 104: 220-233.
Bhattacharjee S, Wang Z, Ciampa J, Kraft P, Chanock S, Yu K, Chatterjee N
Using Principal Components of Genetic Variation for Robust and Powerful Detection of Gene-Gene Interactions in Case-Control and Case-Only studies.
American Journal of Human Genetics, 2010, 86(3):331-342.
# Use the ovarian cancer data data(Xdata, package="CGEN") # Fit using a stratification (categorical) variable ret <- snp.logistic(Xdata, "case.control", "BRCA.status", main.vars=c("oral.years", "n.children"), int.vars=c("oral.years", "n.children"), strata.var=~factor(ethnic.group)) # Compute a summary table for the models getSummary(ret) # Compute a Wald test for the main effect of the SNP and interaction getWaldTest(ret, c("BRCA.status", "BRCA.status:oral.years", "BRCA.status:n.children")) # Fit the same model as above using formulas ret2 <- snp.logistic(Xdata, "case.control", "BRCA.status", main.vars=~oral.years + n.children, int.vars=~oral.years + n.children, strata.var=~factor(ethnic.group))
# Use the ovarian cancer data data(Xdata, package="CGEN") # Fit using a stratification (categorical) variable ret <- snp.logistic(Xdata, "case.control", "BRCA.status", main.vars=c("oral.years", "n.children"), int.vars=c("oral.years", "n.children"), strata.var=~factor(ethnic.group)) # Compute a summary table for the models getSummary(ret) # Compute a Wald test for the main effect of the SNP and interaction getWaldTest(ret, c("BRCA.status", "BRCA.status:oral.years", "BRCA.status:n.children")) # Fit the same model as above using formulas ret2 <- snp.logistic(Xdata, "case.control", "BRCA.status", main.vars=~oral.years + n.children, int.vars=~oral.years + n.children, strata.var=~factor(ethnic.group))
Performs a conditional likelihood-based analysis of matched case-control data typically modeling a particular SNP and a set of covariates that could include environmental covariates or/and other genetic variables. Three alternative analysis options are included: (i) Conditional Logistic Regression (CLR): This method is classical CLR that does not try to utilize G-G or G-E independence allowing the joint distribution of the covariates in the model to be completely unrestricted (non-parametric) (ii) Constrained Conditional Logistic (CCL) : This method performs CLR analysis of case-control data under the assumption of gene-environment (or/and gene-gene) independence not in the entire population but within finely matched case-control sets. (iii) Hybrid Conditional Logistic (HCL): This method is suitable if nearest neighbor matching (see the reference by Bhattacharjee et al. 2010) is performed without regard to case-control status. The likelihood (like CCL) assumes G-G/G-E independence within matched sets but in addition borrows some information across matched sets by using a parametric model to account for heterogeneity in disease across strata.
snp.matched(data, response.var, snp.vars, main.vars=NULL, int.vars=NULL, cc.var=NULL, nn.var=NULL, op=NULL)
snp.matched(data, response.var, snp.vars, main.vars=NULL, int.vars=NULL, cc.var=NULL, nn.var=NULL, op=NULL)
data |
Data frame containing all the data. No default. |
response.var |
Name of the binary response variable coded as 0 (controls) and 1 (cases). No default. |
snp.vars |
A vector of variable names or a formula, generally coding a single SNP variable (see details). No default. |
main.vars |
Vector of variable names or a formula for all covariates of interest
which need to be included in the model as main effects. The default is NULL, so that only the |
int.vars |
Character vector of variable names or a formula for all covariates of interest that will interact with the SNP variable. The default is NULL, so that no interactions will be in the model. |
cc.var |
Integer matching variable with at most 10 subjects per stratum (e.g. CC matching using |
nn.var |
Integer matching variable with at most 8 subjects per stratum (e.g. NN matching using |
op |
Control options for Newton-Raphson optimizer. List containing members "maxiter" (default 100) and "reltol" (default 1e-5). |
To compute HCL, the data is first fit using standard logistic regression. The estimated parameters from the standard logistic regression are then
used as the initial estimates for Newton-Raphson iterations with exact gradient and hessian. Similarly for CCL, the data is first fit using
clogit
using cc.var
to obtain the CLR estimate as an intial estimate and Newton-Raphson is used to maximize
the likelihood.
While snp.logistic
parametrically models the SNP variable, this function is non-parametric and hence offers somewhat
more flexibility. The only constraint on snp.vars
is that it is independent of int.vars
within homogenous matched sets. It can be any
genetic or non-genetic variable or a collection of those. For example 3 SNPs coded as general, dominant and additive can be specified through a single
formula e.g., "snp.vars= ~ (SNP1==1) + (SNP1 == 2) + (SNP2 >= 1)+ SNP3." However, when multiple variables are used in snp.vars
results should be interpreted carefully.
Summary function snp.effects
can only be applied if a single SNP variable is coded.
Note that int.vars
consists of variables that interact with the SNP variable and
can be assumed to be independent of snp.vars
within matched sets. Those interactions for which independence is
not assumed can be included in main.vars
(as product of appropriate variables).
Both CCL and HCL provide considerable gain in power compared to standard CLR. CCL derives more power by generating
pseudo-controls under the assumption of G-G/G-E independence within matched case-control sets. HCL makes the same assumption but allows each matched set to
have any number of cases and controls unlike classical case-control matching. By comparing across matched sets, it is able to estimate the intercept parameter and
improve efficiency of estimating main effects compared to CLR and CCL. At the same time behaves similar to CCL for interactions by assuming
G-G/G-E independence only within mathced sets. For both these methods, the power increase for interaction depends on sizes of the matched sets
in nn.var
, which is currently limited to 8, to avaoid both memory and speed issues.
The authors would like to acknowledge Bijit Kumar Roy for his help in designing the internal data structure and algorithm for HCL/CCL likelihood computations.
A list containing sublists with names CLR, CCL, and HCL. Each sublist contains the parameter estimates (parms), covariance matrix (cov), and log-likelihood (loglike).
Chatterjee N, Zeynep K and Carroll R. Exploiting gene-environment independence in family-based case-control studies:
Increased power for detecting associations, interactions and joint-effects. Genetic Epidemiology 2005; 28:138-156.
Bhattacharjee S., Wang Z., Ciampa J., Kraft P., Chanock S, Yu K., Chatterjee N.
Using Principal Components of Genetic Variation for Robust and Powerful Detection of Gene-Gene Interactions in Case-Control and Case-Only studies.
American Journal of Human Genetics 2010, 86(3):331-342.
Breslow, NE. and Day, NE. Conditional Logistic Regression for Matched Sets. In "Statistical methods in cancer research. Volume I - The analysis of case-control studies." 1980, Lyon: IARC Sci Publ;(32):247-279.
# Use the ovarian cancer data data(Xdata, package="CGEN") # Fake principal component columns # set.seed(123) Ydata <- cbind(Xdata, PC1=rnorm(nrow(Xdata)), PC2=rnorm(nrow(Xdata))) # Match using PC1 and PC2 mx <- getMatchedSets(Ydata, CC=TRUE, NN=TRUE, ccs.var="case.control", dist.vars=c("PC1","PC2"), size = 4) # Append columns for CC and NN matching to the data Zdata <- cbind(Ydata, CCStrat=mx$CC, NNStrat=mx$NN) # Fit using variable names ret1 <- snp.matched(Zdata, "case.control", snp.vars = "BRCA.status", main.vars=c("oral.years", "n.children"), int.vars=c("oral.years", "n.children"), cc.var="CCStrat", nn.var="NNStrat") # Compute a Wald test for the main effect of BRCA.status and its interactions getWaldTest(ret1, c("BRCA.status", "BRCA.status:oral.years", "BRCA.status:n.children")) # Fit the same model as above using formulas. ret2 <- snp.matched(Zdata, "case.control", snp.vars = ~ BRCA.status, main.vars=~oral.years + n.children, int.vars=~oral.years + n.children, cc.var="CCStrat",nn.var="NNStrat") # Compute a summary table for the models getSummary(ret2)
# Use the ovarian cancer data data(Xdata, package="CGEN") # Fake principal component columns # set.seed(123) Ydata <- cbind(Xdata, PC1=rnorm(nrow(Xdata)), PC2=rnorm(nrow(Xdata))) # Match using PC1 and PC2 mx <- getMatchedSets(Ydata, CC=TRUE, NN=TRUE, ccs.var="case.control", dist.vars=c("PC1","PC2"), size = 4) # Append columns for CC and NN matching to the data Zdata <- cbind(Ydata, CCStrat=mx$CC, NNStrat=mx$NN) # Fit using variable names ret1 <- snp.matched(Zdata, "case.control", snp.vars = "BRCA.status", main.vars=c("oral.years", "n.children"), int.vars=c("oral.years", "n.children"), cc.var="CCStrat", nn.var="NNStrat") # Compute a Wald test for the main effect of BRCA.status and its interactions getWaldTest(ret1, c("BRCA.status", "BRCA.status:oral.years", "BRCA.status:n.children")) # Fit the same model as above using formulas. ret2 <- snp.matched(Zdata, "case.control", snp.vars = ~ BRCA.status, main.vars=~oral.years + n.children, int.vars=~oral.years + n.children, cc.var="CCStrat",nn.var="NNStrat") # Compute a summary table for the models getSummary(ret2)
Score tests for genetic association incorporating gene-environment interaction. The function implements two types of score-tests: (1) MScore: A test based on maximum of a class of exposure-weighted score test (Han et al, Biomentrics, 2015) (2) JScore: Joint score-test for genetic association and interaction based on standard logistic model (Song et al., In Prep).
snp.score(data, response.var, snp.var, exposure.var, main.vars, strata.var=NULL, op=NULL)
snp.score(data, response.var, snp.var, exposure.var, main.vars, strata.var=NULL, op=NULL)
data |
Data frame containing all the data. No default. |
response.var |
Name of the binary response variable coded as 0 (controls) and 1 (cases). No default. |
snp.var |
Name of the genotype variable. No default. |
exposure.var |
Character vector of variable names or a formula for the exposure variables. No default. |
main.vars |
Character vector of variable names or a formula for all covariates of interest
which need to be included in the model as main effects.
This argument can be NULL for |
strata.var |
Name of the stratification variable for a retrospective likelihood.
The option allows the genotype frequency to vary by the discrete level of the stratification
variable. Ethnic or geographic origin of subjects, for example, could be used to define
strata. Unlike |
op |
A list of options (see details). The default is NULL. |
The MScore option performs a score test for detecting an association between a SNP and disease risk, encompassing a broad range of risk models, including logistic, probit, and additive models for specifying joint effects of genetic and environmental exposures. The test statistics are obtained my maximizing over a class of score tests, each of which involves modified standard tests of genetic association through a weight function. This weight function reflects the potential heterogeneity of the genetic effects by levels of environmental exposures under a particular model. The MScore test could be performed using either a retrospective or prospective likelihood depending on whether an assumption of gene-environment independence is imposed or not, respectively. The JScore function performs joint score-test for genetic association and gene-environment interaction under a standard logistic regression model. The JScore test could be performed under a prospective likelihood that allows association between gene and environment to remain unrestricted, a retrospective likelihood that assumes gene-environment independence and an empirical-Bayes framework that allows data adaptive shrinkage between retrospective and prospective score-tests. The JScore function is explicitly developed for proper analysis of imputed SNPs under all of the different options. The MScore function should produce valid tests for imputed SNPs under prospective likelihood. But further studies are needed for this method for analysis of imputed SNPs with the retrospective likelihood. The JScore function returns one-step MLE for parameters which can be used to perform meta-analysis across studies using standard techniques.
Options list:
Below are the names for the options list op
.
method
1 or 2 for the test. 1 = MScore, 2 = JScore.
The default is 2.
Options for method 1:
thetas
Numeric vector of values in which the test statistic will be calculated over to find the maximum.
Theta values correspond to different risk models, which can take any value of real numbers.
For example, theta = -1 corresponds to an additive model, 0 to a multiplicative model,
and a probit model is between -1 < theta < 0. Supra-multiplicative model corresponds to theta > 1.
The default is seq(-3, 3, 0.1)
indep
TRUE or FALSE for the gene-environment independence assumption.
The default is FALSE.
doGLM
TRUE or FALSE for calculating the Wald p-value for the SNP main effect. The default is FALSE.
Options for method 2:
sandwich
TRUE or FALSE to return tests and p-values based on a sandwich covariance matrix.
The default is FALSE.
For op$method
= 1 (MScore), the returned object is a list with the following components:
maxTheta
Value of thetas
where the maximum score test occurs.
maxScore
Maximum value of the score test.
pval
P-value of the score test.
pval.logit
P-value of the standard association test based on logistic regression.
model.info
List of information from the model.
For op$method
= 2 (JScore), the returned object is
a list containing test statistics, p-values and one-step MLEs for the parameters and
variance-covariance matrices for the UML, CML and EB methods. Any name in the list
containing the string "test", "pval", "parm" or "var" is a test statistic, p-value,
parameter estimate or variance-covariance matrix respectively.
Han, S.S., Rosenberg, P., Ghosh, A., Landi M.T., Caporaso N. and Chatterjee, N. An exposure weighted score test for genetic association integrating environmental risk-factors. Biometrics 2015 (Article first published online: 1 JUL 2015 | DOI: 10.1111/biom.12328)
Song M., Wheeler B., Chatterjee, N. Using imputed genotype data in joint score tests for genetic association and gene-environment interactions in case-control studies (In preparation).
# Use the ovarian cancer data data(Xdata, package="CGEN") table(Xdata[, "gynSurgery.history"]) # Recode the exposure variable so that it is 0-1 temp <- Xdata[, "gynSurgery.history"] == 2 Xdata[temp, "gynSurgery.history"] <- 1 out <- snp.score(Xdata, "case.control", "BRCA.status", "gynSurgery.history", main.vars=c("n.children","oral.years"), op=list(method=2))
# Use the ovarian cancer data data(Xdata, package="CGEN") table(Xdata[, "gynSurgery.history"]) # Recode the exposure variable so that it is 0-1 temp <- Xdata[, "gynSurgery.history"] == 2 Xdata[temp, "gynSurgery.history"] <- 1 out <- snp.score(Xdata, "case.control", "BRCA.status", "gynSurgery.history", main.vars=c("n.children","oral.years"), op=list(method=2))
The list to describe the file of subject ids for snp.list
The format is: List of 4
Text file containing the subject ids. The file can be a single column of ids, or a delimited file of several columns with the ids as one of the columns. No default.
Column number(s) or variable name(s) containing the subject ids.
Use id.var
=-1 if the file is a single column of ids.
The delimiter in file
.
The default is "".
0 or 1 if the file contains a header of variable names.
This list is should only be used when the genotype file does not contain subject ids.
The order of the ids in this file must match the order of the genotypes in the genotype file.
If the genotype data is in a PLINK format, then id.var
must be of length 2 corresponding the the family id and subject id.
When using the GxE.scan
function, this list can often just be set to the name of the file
containing the ids.
Sample covariate and outcome data
The data is taken from an ovarian cancer study. The file Xdata.txt is a
tab-delimited type 3 data set (see file.type
in pheno.list
).
It contains the variables:
id
The subject id
case.control
Ovarian cancer status (0, 1)
BRCA.status
Simulated data for breast cancer status (0, 1)
oral.years
Years of oral contraceptive use
n.children
Number of children
age.group
Age group in 5 categories (1-5)
ethnic.group
Ethnic group in 3 categories (1-3)
BRCA.history
Personal history of breast cancer (0,1)
gynSurgery.history
History of gynechological surgery (0, 1, 2)
family.history
Family history of breast/ovarian cancer (0, 1, 2)
# Load and print the first 5 rows data(Xdata, package="CGEN") Xdata[1:5, ]
# Load and print the first 5 rows data(Xdata, package="CGEN") Xdata[1:5, ]
Sample covariate and outcome data
The data is taken from a lung cancer study. The file Xdata2.txt is a
tab-delimited type 3 data set (see file.type
in pheno.list
).
It contains the variables:
case.control
Lung cancer status (0, 1)
SNP
Simulated SNP data
smoking
Smoking status, 0: never smokers 1: ever smokers
study
Different studies from which subjects are taken: 1, 2, 3, 4, 5
cov1
A covariate taking integer values
cov2
A covariate taking binary values 0, 1
cov3
A covariate taking integer values
cov4
A covariate taking continuous values
# Load and print the first 5 rows data(Xdata2, package="CGEN") Xdata2[1:5, ]
# Load and print the first 5 rows data(Xdata2, package="CGEN") Xdata2[1:5, ]