This vignette provides a description of how to use the GENESIS package to run genetic association tests on array (SNP) data. GENESIS uses mixed models for genetic association testing, as PC-AiR PCs can be used as fixed effect covariates to adjust for population stratification, and a kinship matrix (or genetic relationship matrix) estimated from PC-Relate can be used to account for phenotype correlation due to genetic similarity among samples.
The fitNullModel
function in the GENESIS
package reads sample data from either a standard data.frame
class object or a ScanAnnotationDataFrame
class object as
created by the GWASTools
package. This object must contain all of the outcome and covariate data
for all samples to be included in the mixed model analysis.
Additionally, this object must include a variable called “scanID” which
contains a unique identifier for each sample in the analysis. While a
standard data.frame
can be used, we recommend using a
ScanAnnotationDataFrame
object, as it can be paired with
the genotype data (see below) to ensure matching of sample phenotype and
genotype data. Through the use of GWASTools,
a ScanAnnotationDataFrame
class object can easily be
created from a data.frame
class object. Example R code for
creating a ScanAnnotationDataFrame
object is presented
below. Much more detail can be found in the GWASTools
package reference manual.
# mypcair contains PCs from a previous PC-AiR analysis
# pheno is a vector of Phenotype values
# make a data.frame
mydat <- data.frame(scanID = mypcair$sample.id, pc1 = mypcair$vectors[,1],
pheno = pheno)
head(mydat)
## scanID pc1 pheno
## NA19919 NA19919 -0.12511091 0.1917327
## NA19916 NA19916 -0.13151738 -0.5687960
## NA19835 NA19835 -0.08832100 0.8734804
## NA20282 NA20282 -0.08617667 0.5787452
## NA19703 NA19703 -0.11969449 1.6116791
## NA19902 NA19902 -0.11458896 0.6663577
## An object of class 'ScanAnnotationDataFrame'
## scans: NA19919 NA19916 ... NA19764 (173 total)
## varLabels: scanID pc1 pheno
## varMetadata: labelDescription
The assocTestSingle
function in the GENESIS
package reads genotype data from a GenotypeData
class
object as created by the GWASTools
package. Through the use of GWASTools,
a GenotypeData
class object can easily be created from:
Example R code for creating a GenotypeData
object is
presented below. Much more detail can be found in the GWASTools
package reference manual.
geno <- MatrixGenotypeReader(genotype = genotype, snpID = snpID,
chromosome = chromosome, position = position,
scanID = scanID)
genoData <- GenotypeData(geno)
genotype
is a matrix of genotype values coded as 0 / 1
/ 2, where rows index SNPs and columns index samplessnpID
is an integer vector of unique SNP IDschromosome
is an integer vector specifying the
chromosome of each SNPposition
is an integer vector specifying the position
of each SNPscanID
is a vector of unique individual IDsfilename
is the file path to the GDS objectThe SNPRelate
package provides the snpgdsBED2GDS
function to convert
binary PLINK files into a GDS file.
snpgdsBED2GDS(bed.fn = "genotype.bed",
bim.fn = "genotype.bim",
fam.fn = "genotype.fam",
out.gdsfn = "genotype.gds")
bed.fn
is the file path to the PLINK .bed filebim.fn
is the file path to the PLINK .bim filefam.fn
is the file path to the PLINK .fam fileout.gdsfn
is the file path for the output GDS fileOnce the PLINK files have been converted to a GDS file, then a
GenotypeData
object can be created as described above.
To demonstrate association testing with the GENESIS package, we analyze SNP data from the Mexican Americans in Los Angeles, California (MXL) and African American individuals in the southwestern USA (ASW) population samples of HapMap 3. Mexican Americans and African Americans have a diverse ancestral background, and familial relatives are present in these data. Genotype data at a subset of 20K autosomal SNPs for 173 individuals are provided as a GDS file.
# read in GDS data
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object with paired ScanAnnotationDataFrame
HapMap_genoData <- GenotypeData(HapMap_geno, scanAnnot = scanAnnot)
HapMap_genoData
## An object of class GenotypeData
## | data:
## File: /tmp/RtmpYFWg76/Rinst2237f821f9d/GENESIS/extdata/HapMap_ASW_MXL_geno.gds (901.8K)
## + [ ] *
## |--+ sample.id { Int32,factor 173 ZIP(40.9%), 283B } *
## |--+ snp.id { Int32 20000 ZIP(34.6%), 27.1K }
## |--+ snp.position { Int32 20000 ZIP(34.6%), 27.1K }
## |--+ snp.chromosome { Int32 20000 ZIP(0.13%), 103B }
## \--+ genotype { Bit2 20000x173, 844.7K } *
## | SNP Annotation:
## NULL
## | Scan Annotation:
## An object of class 'ScanAnnotationDataFrame'
## scans: NA19919 NA19916 ... NA19764 (173 total)
## varLabels: scanID pc1 pheno
## varMetadata: labelDescription
A mixed model for genetic association testing typically includes a
genetic relationship matrix (GRM) to account for genetic similarity
among sample individuals. If we are using kinship coefficient estimates
from PC-Relate to construct this GRM, then the function
pcrelateToMatrix
should be used to provide the matrix in
the appropriate format for fitNullModel
.
# mypcrel contains Kinship Estimates from a previous PC-Relate analysis
myGRM <- pcrelateToMatrix(mypcrel)
myGRM[1:5,1:5]
## 5 x 5 Matrix of class "dsyMatrix"
## NA19625 NA19649 NA19650 NA19651 NA19652
## NA19625 0.9802757412 -0.001187704 -0.001460617 -0.008141279 0.0003310268
## NA19649 -0.0011877043 1.040838166 0.569730537 0.004897574 -0.0128561993
## NA19650 -0.0014606167 0.569730537 1.053822976 0.006999645 0.0146377241
## NA19651 -0.0081412786 0.004897574 0.006999645 1.011024569 0.0093895282
## NA19652 0.0003310268 -0.012856199 0.014637724 0.009389528 0.9818343788
Note that both the row and column names of this matrix are the same scanIDs as used in the scan annotation data.
There are two steps to performing genetic association testing with
GENESIS.
First, the null model (i.e. the model with no SNP genotype term) is fit
using the fitNullModel
function. Second, the output of the
null model fit is used in conjunction with the genotype data to quickly
run SNP-phenotype association tests using the
assocTestSingle
function. There is a computational
advantage to splitting these two steps into two function calls; the null
model only needs to be fit once, and SNP association tests can be
paralelized by chromosome or some other partitioning to speed up
analyses (details below).
The first step for association testing with GENESIS is to fit the mixed model under the null hypothesis that each SNP has no effect. This null model contains all of the covariates, including ancestry representative PCs, as well as any random effects, such as a polygenic effect due to genetic relatedness, but it does not include any SNP genotype terms as fixed effects.
Using the fitNullModel
function, random effects in the
null model are specified via their covariance structures. This allows
for the inclusion of a polygenic random effect using a kinship matrix or
genetic relationship matrix (GRM).
A linear mixed model (LMM) should be fit when analyzing a quantitative phenotype. The example R code below fits a basic null mixed model.
# fit the null mixed model
nullmod <- fitNullModel(scanAnnot, outcome = "pheno", covars = "pc1",
cov.mat = myGRM, family = "gaussian")
## [1] 0.4545551 0.4545551 -240.5810010 1.0922746
## [1] 0.07965827 0.76971759 -237.50257078 1.07504652
## [1] 0.1002053 0.8089861 -237.4969271 1.0050101
## [1] 0.1010074 0.8127311 -237.4969254 1.0000249
ScanAnnotationDataFrame
or data.frame
object containing the sample dataoutcome
specifies the name of the outcome variable in
scanAnnot
covars
specifies the names of the covariates in
scanAnnot
cov.mat
specifies the covariance structures for the
random effects included in the modelfamily
should be gaussian for a quantitative phenotype,
specifying a linear mixed modelThe Average Information REML (AIREML) procedure is used to estimate
the variance components of the random effects. When
verbose = TRUE
, the variance component estimates, the
log-likelihood, and the residual sum of squares in each iteration are
printed to the R console (shown above). In this example,
Sigma^2_A
is the variance component for the random effect
specified in cov.mat
, and Sigma^2_E
is the
residual variance component.
The model can be fit with multiple fixed effect covariates by setting
covars
equal to vector of covariate names. For example, if
we wanted to include the variables “pc1”, “pc2”, “sex”, and “age” all as
covariates in the model:
The model also can be fit with multiple random effects. This is done
by setting cov.mat
equal to a list of matrices. For
example, if we wanted to include a polygenic random effect with
covariance structure given by the matrix “myGRM” and a household random
effect with covariance structure specified by the matrix “H”:
nullmod <- fitNullModel(scanAnnot, outcome = "pheno", covars = "pc1",
cov.mat = list("GRM" = myGRM, "House" = H),
family = "gaussian")
The names of the matrices in cov.mat
determine the names
of the variance component parameters. Therefore, in this example, the
output printed to the R console will include Sigma^2_GRM
for the random effect specified by “myGRM”, Sigma^2_House
for the random effect specified by “H”, and Sigma^2_E
for
the residual variance component.
Note: the row and column names of each matrix used to specify the covariance structure of a random effect in the mixed model must be the unique scanIDs for each sample in the analysis.
LMMs are typically fit under an assumption of constant (homogeneous)
residual variance for all observations. However, for some outcomes,
there may be evidence that different groups of observations have
different residual variances, in which case the assumption of
homoscedasticity is violated. group.var
can be used in
order to fit separate (heterogeneous) residual variance components by
some grouping variable. For example, if we have a categorical variable
“study” in our scanAnnot
, then we can estimate a different
residual variance component for each unique value of “study” by using
the following code:
nullmod <- fitNullModel(scanAnnot, outcome = "pheno", covars = "pc1",
cov.mat = myGRM, family = "gaussian",
group.var = "study")
In this example, the residual variance component
Sigma^2_E
is replaced with group specific residual variance
components Sigma^2_study1
, Sigma^2_study2
, …,
where “study1”, “study2”, … are the unique values of the “study”
variable.
Ideally, a generalized linear mixed model (GLMM) would be fit for a
binary phenotype; however, fitting a GLMM is much more computationally
demanding than fitting an LMM. To provide a compuationally efficient
approach to fitting such a model, fitNullModel
uses the
penalized quasi-likelihood (PQL) approximation to the GLMM (Breslow and
Clayton). The implementation of this procedure in GENESIS
is the same as in GMMAT (Chen et al.), and more details can be found in
that manuscript. If our outcome variable, “pheno”, were binary, then the
same R code could be used to fit the null model, but with
family = binomial
.
nullmod <- fitNullModel(scanAnnot, outcome = "pheno", covars = "pc1",
cov.mat = myGRM, family = "binomial")
Multiple fixed effect covariates and multiple random effects can be
specified for binary phenotypes in the same way as they are for
quantitative phenotypes. group.var
does not apply here.
The second step for association testing with GENESIS
is to use the fitted null model to test the SNPs in the
GenotypeData
object for association with the specified
outcome variable. This is done with the assocTestSingle
function. The use of assocTestSingle
for running
association tests with a quantitative or binary phenotype is
identical.
Before we can run an association test on a GenotypeData
object, we much first decide how many SNPs we want to read at a time. We
do this by creating a GenotypeBlockIterator
object that
defines blocks of SNPs. The default setting is to read 10,000 SNPs in
each block, but this may be changed with the snpBlock
argument.
The example R code below runs the association analyses using the null
model we fit using fitNullModel
in the previous
section.
genoData
is a GenotypeData
class
objectnull.model
is the output from
fitNullModel
By default, the function will perform association tests at all SNPs
in the genoData
object. However, for computational reasons
it may be practical to parallelize this step, partitioning SNPs by
chromosome or some other pre-selected grouping. If we only want to test
a pre-specified set of SNPs, this can be done by passing a vector of
snpID values to the snpInclude
argument when we create the
iterator.
The fitNullModel
function will return a list with a
large amount of data. Some of the more useful output for the user
includes:
varComp
: the variance component estimates for the
random effectsfixef
: a data.frame
with point estimates,
standard errors, test statistics, and p-values for each of the fixed
effect covariatesfit
: a data.frame
with the outcome, the
fitted values, and various residuals from the modelThere are also metrics assessing model fit such as the log-likelihood
(logLik
), restricted log-likelihood (logLikR
),
and the Akaike information criterion (AIC
). Additionally,
there are some objects such as the working outcome vector
(workingY
) and the Cholesky decomposition of the inverse of
the estimated phenotype covariance matrix (cholSigmaInv
)
that are used by the assocTestSingle
function for
association testing. Further details describing all of the output can be
found with the command help(fitNullModel)
.
The assocTestSingle
function will return a
data.frame
with summary information from the association
test for each SNP. Each row corresponds to a different SNP.
## variant.id chr pos n.obs freq MAC Score Score.SE Score.Stat
## 1 1 1 1 173 0.3901734 135 1.201347 8.670690 0.1385526
## 2 2 1 2 173 0.4942197 171 -6.915853 9.150746 -0.7557693
## 3 3 1 3 173 0.1011561 35 -1.366449 5.441663 -0.2511087
## 4 4 1 4 173 0.4855491 168 -7.120154 9.429106 -0.7551251
## 5 5 1 5 172 0.4447674 153 7.404077 8.708558 0.8502070
## 6 6 1 6 172 0.2093023 72 10.364356 7.374279 1.4054739
## Score.pval Est Est.SE PVE
## 1 0.8898037 0.01597942 0.1153311 0.0001122594
## 2 0.4497875 -0.08259101 0.1092807 0.0033401932
## 3 0.8017301 -0.04614558 0.1837674 0.0003687368
## 4 0.4501739 -0.08008448 0.1060546 0.0033345010
## 5 0.3952100 0.09762892 0.1148296 0.0042270995
## 6 0.1598804 0.19059138 0.1356065 0.0115515076
variant.id
: the unique snp IDchr
: the chromosomepos
: the positionn.obs
: the number of samples analyzed at that SNPfreq
: the frequency of the tested (“A”) alleleMAC
: the minor allele countScore
: the value of the score functionScore.SE
: the estimated standard error of the
scoreScore.Stat
: the score Z test statisticScore.pval
: the p-value based on the score test
statisticEst
: an approximation of the effect size estimate
(beta) for that SNPEst.SE
: an approximation of the standard error of the
effect size estimatePVE
: an approximation of the proportion of phenotype
variance explainedFurther details describing all of the output can be found with the
command help(assocTestSingle)
.
It is often of interest to estimate the proportion of the total
phenotype variability explained by the entire set of genotyped SNPs
avaialable; this provides an estimate of the narrow sense heritability
of the trait. One method for estimating heritability is to use the
variance component estimates from the null mixed model. GENESIS
includes the varCompCI
function for computing the
proportion of variance explained by each random effect along with 95%
confidence intervals.
## Proportion Lower 95 Upper 95
## V_A 0.110543 -0.2190884 0.4401745
## V_resid.var 0.889457 0.5598255 1.2190884
fitNullModel
prop
is a logical indicator of whether the point
estimates and confidence intervals should be returned as the proportion
of total variability explained (TRUE) or on the orginal scale
(FALSE)When additional random effects are included in the model (e.g. a
shared household effect), varCompCI
will also return the
proportion of variability explained by each of these components.
Note: varCompCI
can not compute proportions of variance
explained when heterogeneous residual variances are used in the null
model (i.e. group.var
is used in
fitNullModel
). Confidence intervals can still be computed
for the variance component estimates on the original scale by setting
prop = FALSE
.
Note: variance component estimates are not interpretable for binary
phenotypes when fit using the PQL method implemented in
fitNullModel
; proportions of variance explained should not
be calculated for these models.
Breslow NE and Clayton DG. (1993). Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association 88: 9-25.
Chen H, Wang C, Conomos MP, Stilp AM, Li Z, Sofer T, Szpiro AA, Chen W, Brehm JM, Celedon JC, Redline S, Papanicolaou GJ, Thornton TA, Laurie CC, Rice K and Lin X. Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies Using Logistic Mixed Models. American Journal of Human Genetics, 98(4):653-66.
Gogarten, S.M., Bhangale, T., Conomos, M.P., Laurie, C.A., McHugh, C.P., Painter, I., … & Laurie, C.C. (2012). GWASTools: an R/Bioconductor package for quality control and analysis of Genome-Wide Association Studies. Bioinformatics, 28(24), 3329-3331.