Package 'GENESIS'

Title:	GENetic EStimation and Inference in Structured samples (GENESIS): Statistical methods for analyzing genetic data from samples with population structure and/or relatedness
Description:	The GENESIS package provides methodology for estimating, inferring, and accounting for population and pedigree structure in genetic analyses. The current implementation provides functions to perform PC-AiR (Conomos et al., 2015, Gen Epi) and PC-Relate (Conomos et al., 2016, AJHG). PC-AiR performs a Principal Components Analysis on genome-wide SNP data for the detection of population structure in a sample that may contain known or cryptic relatedness. Unlike standard PCA, PC-AiR accounts for relatedness in the sample to provide accurate ancestry inference that is not confounded by family structure. PC-Relate uses ancestry representative principal components to adjust for population structure/ancestry and accurately estimate measures of recent genetic relatedness such as kinship coefficients, IBD sharing probabilities, and inbreeding coefficients. Additionally, functions are provided to perform efficient variance component estimation and mixed model association testing for both quantitative and binary phenotypes.
Authors:	Matthew P. Conomos, Stephanie M. Gogarten, Lisa Brown, Han Chen, Thomas Lumley, Kenneth Rice, Tamar Sofer, Adrienne Stilp, Timothy Thornton, Chaoyu Yu
Maintainer:	Stephanie M. Gogarten <[email protected]>
License:	GPL-3
Version:	2.37.0
Built:	2024-12-29 06:56:21 UTC
Source:	https://github.com/bioc/GENESIS

Help Index

GENetic EStimation and Inference in Structured samples (GENESIS): Statistical methods for analyzing genetic data from samples with population structure and/or relatedness
admixMap
Aggregate Association Testing
Genotype Association Testing with Mixed Models
Computes variant-specific inflation factors
Return the effect allele for association testing
Fit a Model Under the Null Hypothesis
Defunct functions in package GENESIS
Matrix of Pairwise Kinship Coefficient Estimates for the combined HapMap ASW and MXL Sample found with the KING-robust estimator from the KING software.
Perform a joint score test
Store kinship matrix in GDS
Convert KING text output to an R Matrix
Make a sparse matrix from a dense matrix or a table of pairwise values
PC-AiR: Principal Components Analysis in Related Samples
Partition a sample into an ancestry representative 'unrelated subset' and a 'related subset'
PC-Relate: Model-Free Estimation of Recent Genetic Relatedness
Creates a Genetic Relationship Matrix (GRM) of Pairwise Kinship Coefficient Estimates from PC-Relate Output
PC-AiR: Plotting PCs
PC-AiR: Principal Components Analysis in Related Samples
Annotation for 1000 genomes Phase 3 samples
Variance Component Confidence Intervals

GENetic EStimation and Inference in Structured samples (GENESIS): Statistical methods for analyzing genetic data from samples with population structure and/or relatedness

Description

The GENESIS package provides methodology for estimating, inferring, and accounting for population and pedigree structure in genetic analyses. The current implementation performs PC-AiR (Conomos et al., 2015, Gen Epi) and PC-Relate (Conomos et al., 2016, AJHG). PC-AiR performs a Principal Components Analysis on genome-wide SNP data for the detection of population structure in a sample that may contain known or cryptic relatedness. Unlike standard PCA, PC-AiR accounts for relatedness in the sample to provide accurate ancestry inference that is not confounded by family structure. PC-Relate uses ancestry representative principal components to adjust for population structure/ancestry and accurately estimate measures of recent genetic relatedness such as kinship coefficients, IBD sharing probabilities, and inbreeding coefficients. Additionally, functions are provided to perform efficient variance component estimation and mixed model association testing for both quantitative and binary phenotypes.

Details

The PC-AiR analysis is performed using the pcair function, which takes genotype data and pairwise measures of kinship and ancestry divergence as input and returns PC-AiR PCs as the ouput. The function pcairPartition is called within pcair and uses the PC-AiR algorithm to partition the sample into an ancestry representative ‘unrelated subset’ and ‘related subset’. The function plot.pcair can be used to plot pairs of PCs from a class 'pcair' object returned by the function pcair. The function kingToMatrix can be used to convert output text files from the KING software (Manichaikul et al., 2010) into an R matrix of pairwise kinship coefficient estimates in a format that can be used by the functions pcair and pcairPartition. The PC-Relate analysis is performed using the pcrelate function, which takes genotype data and PCs from PC-AiR and returns estimates of kinship coefficients, IBD sharing probabilities, and inbreeding coefficients. There are two functions required to perform SNP genotype association testing with mixed models. First, fitNullModel is called to fit the null model (i.e. no SNP genotype term) including fixed effects covariates, such as PC-AiR PCs, and random effects specified by their covariance structures, such as a kinship matrix created from PC-Relate output using pcrelateToMatrix. The function fitNullModel uses AIREML to estimate variance components for the random effects, and the function varCompCI can be used to find confidence intervals on the estimates as well as the proportion of total variability they explain; this allows for heritability estimation. Second, assocTestSingle is called with the null model output and the genotype data to perform either Wald or score based association tests.

Author(s)

Matthew P. Conomos, Stephanie M. Gogarten, Lisa Brown, Han Chen, Ken Rice, Tamar Sofer, Timothy Thornton, Chaoyu Yu

Maintainer: Stephanie M. Gogarten <[email protected]>

References

Conomos M.P., Miller M., & Thornton T. (2015). Robust Inference of Population Structure for Ancestry Prediction and Correction of Stratification in the Presence of Relatedness. Genetic Epidemiology, 39(4), 276-293.

Conomos M.P., Reiner A.P., Weir B.S., & Thornton T.A. (2016). Model-free Estimation of Recent Genetic Relatedness. American Journal of Human Genetics, 98(1), 127-148.

Manichaikul, A., Mychaleckyj, J.C., Rich, S.S., Daly, K., Sale, M., & Chen, W.M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867-2873.

admixMap

Description

Run admixture analyses

Usage

admixMap(admixDataList, null.model, male.diploid=TRUE,
         genome.build=c("hg19", "hg38"),
         BPPARAM=bpparam(), verbose=TRUE)
admixMap(admixDataList, null.model, male.diploid=TRUE,
         genome.build=c("hg19", "hg38"),
         BPPARAM=bpparam(), verbose=TRUE)

Arguments

`admixDataList`	named list of `GenotypeIterator` or `SeqVarIterator` objects for each ancestry
`null.model`	A null model object returned by `fitNullModel`.
`male.diploid`	Logical for whether males on sex chromosomes are coded as diploid. Default is 'male.diploid=TRUE', meaning sex chromosome genotypes for males have values 0/2. If the input object codes males as 0/1 on sex chromosomes, set 'male.diploid=FALSE'.
`genome.build`	A character sting indicating genome build; used to identify pseudoautosomal regions on the X and Y chromosomes. These regions are not treated as sex chromosomes when calculating allele frequencies.
`BPPARAM`	A `BiocParallelParam` object to process blocks of variants in parallel. If not provided, the default back-end returned by `bpparam` will be used.
`verbose`	Logical indicator of whether updates from the function should be printed to the console; the default is TRUE.

Details

This function is used with local ancestry results such as those obtained from RFMix. RFMix output may be converted to PLINK format, and then to GDS with snpgdsBED2GDS.

admixDataList should have one value for each ancestry to be included in the test. The sum of all local ancestries at a particular locus must add up to 2, so if there are K ancestry groups, then only K-1 genotypes can be included since one local ancestry count can be written as a linear combination of all of the other local ancestry counts, resulting in collinearity and a matrix that won't be invertible.

See the example for how one might set up the admixDataList object. List names will propagate to the output file.

admixMap uses the BiocParallel package to process iterator chunks in parallel. See the BiocParallel documentation for more information on the default behaviour of bpparam and how to register different parallel backends. If serial execution is desired, set BPPARAM=BiocParallel::SerialParam(). Note that parallel execution requires more RAM than serial execution.

p-values that are calculated using pchisq and are smaller than .Machine\$double.xmin are set to .Machine\$double.xmin.

Value

A data.frame where each row refers to a different variant with the columns:

`variant.id`	The variant ID
`chr`	The chromosome value
`pos`	The base pair position
`n.obs`	The number of samples with non-missing genotypes
`*.freq`	The estimated frequency of alleles derived from each ancestry at that variant
`*.Est`	The effect size estimate for each additional copy of an allele derived from each ancestry, relative to the reference ancestry
`*.SE`	The estimated standard error of the effect size estimate for each ancestry
`Joint.Stat`	The chi-square Wald test statistic for the joint test of all local ancestry terms
`Joint.pval`	The Wald p-value for the joint test of all local ancestry terms

Author(s)

Matthew P. Conomos, Lisa Brown, Stephanie M. Gogarten, Tamar Sofer, Ken Rice, Chaoyu Yu

References

Brown, L.A. et al. (2017). Admixture Mapping Identifies an Amerindian Ancestry Locus Associated with Albuminuria in Hispanics in the United States. J Am Soc Nephrol. 28(7):2211-2220.

Maples, B.K. et al. (2013). RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet. 93(2):278-88.

Examples

library(GWASTools)

# option 1: one GDS file per ancestry
afrfile <- system.file("extdata", "HapMap_ASW_MXL_local_afr.gds", package="GENESIS")
amerfile <- system.file("extdata", "HapMap_ASW_MXL_local_amer.gds", package="GENESIS")
eurfile <- system.file("extdata", "HapMap_ASW_MXL_local_eur.gds", package="GENESIS")
files <- list(afr=afrfile, amer=amerfile, eur=eurfile)
gdsList <- lapply(files, GdsGenotypeReader)

# make ScanAnnotationDataFrame
scanAnnot <- ScanAnnotationDataFrame(data.frame(
    scanID=getScanID(gdsList[[1]]), stringsAsFactors=FALSE))

# generate a phenotype
set.seed(4)
nsamp <- nrow(scanAnnot)
scanAnnot$pheno <- rnorm(nsamp, mean=0, sd=1)
set.seed(5)
scanAnnot$covar <- sample(0:1, nsamp, replace=TRUE)

genoDataList <- lapply(gdsList, GenotypeData, scanAnnot=scanAnnot)

# iterators
# if we have 3 ancestries total, only 2 should be included in test
genoIterators <- lapply(genoDataList[1:2], GenotypeBlockIterator)

# fit the null mixed model
null.model <- fitNullModel(scanAnnot, outcome="pheno", covars="covar")

# run the association test
myassoc <- admixMap(genoIterators, null.model,
                    BPPARAM=BiocParallel::SerialParam())
head(myassoc)

lapply(genoDataList, close)


# option 2: create a single file with multiple ancestries
# first, get dosages from all ancestries
library(gdsfmt)
dosages <- lapply(files, function(f) {
    gds <- openfn.gds(f)
    geno <- read.gdsn(index.gdsn(gds, "genotype"))
    closefn.gds(gds)
    geno
})
lapply(dosages, dim)

# create a new file with three dosage matrices, keeping all
# sample and snp nodes from one original file
tmpfile <- tempfile()
file.copy(afrfile, tmpfile)
gds <- openfn.gds(tmpfile, readonly=FALSE)
delete.gdsn(index.gdsn(gds, "genotype"))
add.gdsn(gds, "dosage_afr", dosages[["afr"]])
add.gdsn(gds, "dosage_amer", dosages[["amer"]])
add.gdsn(gds, "dosage_eur", dosages[["eur"]])
closefn.gds(gds)
cleanup.gds(tmpfile)

# read in GDS data, specifying the node for each ancestry
gds <- openfn.gds(tmpfile)
gds
genoDataList <- list()
for (anc in c("afr", "amer", "eur")){
  gdsr <- GdsGenotypeReader(gds, genotypeVar=paste0("dosage_", anc))
  genoDataList[[anc]] <- GenotypeData(gdsr, scanAnnot=scanAnnot)
}

# iterators
genoIterators <- lapply(genoDataList[1:2], GenotypeBlockIterator)

# run the association test
myassoc <- admixMap(genoIterators, null.model,
                    BPPARAM=BiocParallel::SerialParam())

close(genoDataList[[1]])
unlink(tmpfile)
library(GWASTools)

# option 1: one GDS file per ancestry
afrfile <- system.file("extdata", "HapMap_ASW_MXL_local_afr.gds", package="GENESIS")
amerfile <- system.file("extdata", "HapMap_ASW_MXL_local_amer.gds", package="GENESIS")
eurfile <- system.file("extdata", "HapMap_ASW_MXL_local_eur.gds", package="GENESIS")
files <- list(afr=afrfile, amer=amerfile, eur=eurfile)
gdsList <- lapply(files, GdsGenotypeReader)

# make ScanAnnotationDataFrame
scanAnnot <- ScanAnnotationDataFrame(data.frame(
    scanID=getScanID(gdsList[[1]]), stringsAsFactors=FALSE))

# generate a phenotype
set.seed(4)
nsamp <- nrow(scanAnnot)
scanAnnot$pheno <- rnorm(nsamp, mean=0, sd=1)
set.seed(5)
scanAnnot$covar <- sample(0:1, nsamp, replace=TRUE)

genoDataList <- lapply(gdsList, GenotypeData, scanAnnot=scanAnnot)

# iterators
# if we have 3 ancestries total, only 2 should be included in test
genoIterators <- lapply(genoDataList[1:2], GenotypeBlockIterator)

# fit the null mixed model
null.model <- fitNullModel(scanAnnot, outcome="pheno", covars="covar")

# run the association test
myassoc <- admixMap(genoIterators, null.model,
                    BPPARAM=BiocParallel::SerialParam())
head(myassoc)

lapply(genoDataList, close)


# option 2: create a single file with multiple ancestries
# first, get dosages from all ancestries
library(gdsfmt)
dosages <- lapply(files, function(f) {
    gds <- openfn.gds(f)
    geno <- read.gdsn(index.gdsn(gds, "genotype"))
    closefn.gds(gds)
    geno
})
lapply(dosages, dim)

# create a new file with three dosage matrices, keeping all
# sample and snp nodes from one original file
tmpfile <- tempfile()
file.copy(afrfile, tmpfile)
gds <- openfn.gds(tmpfile, readonly=FALSE)
delete.gdsn(index.gdsn(gds, "genotype"))
add.gdsn(gds, "dosage_afr", dosages[["afr"]])
add.gdsn(gds, "dosage_amer", dosages[["amer"]])
add.gdsn(gds, "dosage_eur", dosages[["eur"]])
closefn.gds(gds)
cleanup.gds(tmpfile)

# read in GDS data, specifying the node for each ancestry
gds <- openfn.gds(tmpfile)
gds
genoDataList <- list()
for (anc in c("afr", "amer", "eur")){
  gdsr <- GdsGenotypeReader(gds, genotypeVar=paste0("dosage_", anc))
  genoDataList[[anc]] <- GenotypeData(gdsr, scanAnnot=scanAnnot)
}

# iterators
genoIterators <- lapply(genoDataList[1:2], GenotypeBlockIterator)

# run the association test
myassoc <- admixMap(genoIterators, null.model,
                    BPPARAM=BiocParallel::SerialParam())

close(genoDataList[[1]])
unlink(tmpfile)

Aggregate Association Testing

Description

assocTestAggregate performs aggregate association tests using the null model fit with fitNullModel.

Usage

## S4 method for signature 'SeqVarIterator'
assocTestAggregate(gdsobj, null.model, AF.max=1,
                   weight.beta=c(1,1), weight.user=NULL,
                   test=c("Burden", "SKAT", "fastSKAT", "SMMAT", "fastSMMAT",
                   "SKATO", "BinomiRare", "CMP"),
                   neig = 200, ntrace = 500,
                   rho = seq(from = 0, to = 1, by = 0.1),
                   sparse=TRUE, imputed=FALSE,
                   male.diploid=TRUE, genome.build=c("hg19", "hg38"),
                   BPPARAM=bpparam(), verbose=TRUE)
## S4 method for signature 'GenotypeIterator'
assocTestAggregate(gdsobj, null.model, AF.max=1,
                   weight.beta=c(1,1), weight.user=NULL,
                   test=c("Burden", "SKAT", "fastSKAT", "SMMAT", "fastSMMAT",
                   "SKATO", "BinomiRare", "CMP"),
                   neig = 200, ntrace = 500,
                   rho = seq(from = 0, to = 1, by = 0.1),
                   male.diploid=TRUE, BPPARAM=bpparam(), verbose=TRUE)
## S4 method for signature 'SeqVarIterator'
assocTestAggregate(gdsobj, null.model, AF.max=1,
                   weight.beta=c(1,1), weight.user=NULL,
                   test=c("Burden", "SKAT", "fastSKAT", "SMMAT", "fastSMMAT",
                   "SKATO", "BinomiRare", "CMP"),
                   neig = 200, ntrace = 500,
                   rho = seq(from = 0, to = 1, by = 0.1),
                   sparse=TRUE, imputed=FALSE,
                   male.diploid=TRUE, genome.build=c("hg19", "hg38"),
                   BPPARAM=bpparam(), verbose=TRUE)
## S4 method for signature 'GenotypeIterator'
assocTestAggregate(gdsobj, null.model, AF.max=1,
                   weight.beta=c(1,1), weight.user=NULL,
                   test=c("Burden", "SKAT", "fastSKAT", "SMMAT", "fastSMMAT",
                   "SKATO", "BinomiRare", "CMP"),
                   neig = 200, ntrace = 500,
                   rho = seq(from = 0, to = 1, by = 0.1),
                   male.diploid=TRUE, BPPARAM=bpparam(), verbose=TRUE)

Arguments

`gdsobj`	An object of class `SeqVarIterator` from the package SeqVarTools containing the genotype data for the variants and samples to be used for the analysis.
`null.model`	A null model object returned by `fitNullModel`.
`AF.max`	A numeric value specifying the upper bound on the effect allele frequency for variants to be included in the analysis.
`weight.beta`	A numeric vector of length two specifying the two parameters of the Beta distribution used to determine variant weights; weights are given by `dbeta(MAF, a, b)`, where MAF is the minor allele frequency, and a and b are the two parameters specified here. `weight.beta = c(1,25)` gives the Wu weights; `weight.beta = c(0.5, 0.5)` is proportional to the Madsen-Browning weights; and `weight.beta = c(1,1)` gives a weight of 1 to all variants. This input is ignored when `weight.user` is not `NULL`.
`weight.user`	A character string specifying the name of a variable to be used as variant weights. This variable can be in either 1) the variantData slot of `gdsobj` or 2) the `mcols` of the `GRanges` or `GRangesList` object used to create `gdsobj` (when `gdsobj` is a `link{SeqVarRangeIterator}` or `link{SeqVarListIterator}`). When left `NULL` (the default), the weights specified by `weight.beta` will be used.
`test`	A character string specifying the type of test to be performed. The possibilities are `"Burden"` (default), `"SKAT"`, `"fastSKAT"`, `"SMMAT"`, `"fastSMMAT"`, `"BinomiRare"`, or `"CMP"`.
`neig`	The number eigenvalues to approximate by using random projections for calculating p-values with fastSKAT or fastSMMAT; default is 200. See 'Details' for more information.
`ntrace`	The number of vectors to sample when using random projections to estimate the trace needed for p-value calculation with fastSKAT or fastSMMAT; default is 500. See 'Details' for more information.
`rho`	A numeric value (or vector of numeric values) in `[0,1]` specifying the rho parameter when using `test == "SKATO"`; these are the values for which SKAT-O is performed, defining the search space for the optimal `rho`. If `rho = 0`, this is equivalent to a standard SKAT test; if `rho = 1`, this is equivalent to a score burden test.
`sparse`	Logical indicator of whether to read genotypes as sparse Matrix objects; the default is `TRUE`. Set this to `FALSE` if the alternate allele dosage of the genotypes in the test are not expected to be mostly 0.
`imputed`	Logical indicator of whether to read dosages from the "DS" field containing imputed dosages instead of counting the number of alternate alleles.
`male.diploid`	Logical for whether males on sex chromosomes are coded as diploid. Default is 'male.diploid=TRUE', meaning sex chromosome genotypes for males have values 0/2. If the input `gdsobj` codes males as 0/1 on sex chromosomes, set 'male.diploid=FALSE'.
`genome.build`	A character sting indicating genome build; used to identify pseudoautosomal regions on the X and Y chromosomes. These regions are not treated as sex chromosomes when calculating allele frequencies.
`BPPARAM`	A `BiocParallelParam` object to process blocks of variants in parallel. If not provided, the default back-end returned by `bpparam` will be used.
`verbose`	Logical indicator of whether updates from the function should be printed to the console; the default is `TRUE`.

Details

The type of aggregate unit tested depends on the class of iterator used for gdsobj. Options include sliding windows, specific ranges of variants or selection of individual variants (ranges with width 1). See SeqVarIterator for more details.

assocTestAggregate uses the BiocParallel package to process iterator chunks in parallel. See the BiocParallel documentation for more information on the default behaviour of bpparam and how to register different parallel backends. If serial execution is desired, set BPPARAM=BiocParallel::SerialParam(). Note that parallel execution requires more RAM than serial execution.

All samples included in null model will be included in the association test, even if a different set of samples is present in the current filter for gdsobj.

The effect size estimate is for each copy of the alternate allele (when gdsobj is a SeqVarIterator object) or the "A" allele (when gdsobj is a GenotypeIterator object). We refer to this as the "effect allele" in the rest of the documentation. For multiallelic variants in SeqVarIterator objects, each alternate (or "A") allele is tested separately.

Monomorphic variants (including variants where every sample is a heterozygote) are always omitted from the aggregate unit prior to testing.

Somewhat similarly to SKAT-O, the variant Set Mixed Model Association Test (SMMAT, Chen et al., 2019) combines the burden test p-value with an adjusted SKAT (which is asymptotically independent of the burden test) p-value using a chi-square distribution with 4df from Fisher's method.

SKAT and SMMAT will attempt to use Davies' method (i.e. integration) to calculate p-values; if an error occurs in integration or the reported p-values are too small that they are unreliable (i.e. near machine epsilon), then the saddlepoint approximation will instead be used to calculate the p-values.

The fastSKAT method of Lumley et al. (2018) uses random matrix theory to speed up the computation of SKAT p-values. When test = "fastSKAT", the function attempts to inteligently determine which p-value calculation approach to use for each aggregation unit: (1) if min(number samples, number variants) is small enough, then the standard SKAT p-value calculation is used; (2) if min(number samples, number variants) is too large for standard SKAT, but small enough to explicitly compute the genotype covariance matrix, random projections are used to approximate the eigenvalues of the covariance matrix, and the fastSKAT p-value calculation is used; (3) if min(number samples, number variants) is too big to explicitly compute the genotype covariance matrix, random projections are used to approximate both the eigenvalues and the trace of the covariance matrix, and the fastSKAT p-value calculation is used.)

The fastSMMAT method uses the same random matrix theory as fastSKAT to speed up the computation of the p-value for the adjusted SKAT component of the test. When test = "fastSMMAT", the function uses the same logic as for fastSKAT to determine which p-value calculation approach to use for each aggregation unit.

The BinomiRare test, run by using test = "BinomiRare", and the CMP test, run by using test = "CMP" are carrier-only, robust tests. Only variants where the effect allele is minor will be tested. Both tests focuse on carriers of the rare variant allele ("carriers"), and use the estimated probabilities of the binary outcome within the carriers, estimated under the null of not association, and the actual number of observed outcomes, to compute p-values. BinomiRare uses the Poisson-Binomial distribution, and the CMP uses the Conway-Maxwell-Poisson distribution, and is specifically designed for mixed models. (If test = "CMP" but null.model$family$mixedmodel = FALSE, the BinomiRare test will be run instead.) These tests provide both a traditional p-value ("pval") and a mid-p-value ("midp"), which is less conservative/more liberal, with the difference being more pronounced for small number of carriers. The BinomiRare test is described in Sofer (2017) and compared to the Score and SPA in various settings in Sofer and Guo (2020).

p-values that are calculated using pchisq and are smaller than .Machine\$double.xmin are set to .Machine\$double.xmin.

Value

A list with the following items:

results

A data.frame containing the results from the main analysis. Each row is a separate aggregate test:

If gdsobj is a SeqVarWindowIterator:

`chr`	The chromosome value
`start`	The start position of the window
`end`	The end position of the window

Always:

`n.site`	The number of variant sites included in the test.
`n.alt`	The number of alternate (effect) alleles included in the test.
`n.sample.alt`	The number of samples with an observed alternate (effect) allele at any variant in the aggregate set.

If test is "Burden":

`Score`	The value of the score function
`Score.SE`	The estimated standard error of the Score
`Score.Stat`	The score Z test statistic
`Score.pval`	The score p-value
`Est`	An approximation of the effect size estimate for each additional unit of burden
`Est.SE`	An approximation of the standard error of the effect size estimate
`PVE`	An approximation of the proportion of phenotype variance explained

If test is "SKAT" or "fastSKAT":

`Q`	The SKAT test statistic.
`pval`	The SKAT p-value.
`err`	Takes value 1 if there was an error in calculating the p-value; takes the value 2 when multiple random projections were required to get a good approximation from fastSKAT (the reported p-value is likely still reliable); 0 otherwise.
`pval.method`	The p-value calculation method used. When standard SKAT is used, one of "integration" or "saddlepoint"; when fastSKAT random projections are used to approximate eigenvalues of the genotype covariance matrix, one of "ssvd_integration" or "ssvd_saddlepoint"; when fastSKAT random projections are used to approximate both the eigenvalues and the trace of the genotype covariance matrix, one of "rsvd_integration" or "rsvd_saddlepoint".

If test is "SMMAT" or "fastSMMAT":

`Score_burden`	The value of the score function for the burden test
`Score.SE_burden`	The estimated standard error of the Score for the burden test
`Stat_burden`	The score Z test statistic for the burden test
`pval_burden`	The burden test p-value.
`Q_theta`	The test statistic for the adjusted SKAT test (which is asymptotically independent of the burden test)
`pval_theta`	The p-value of the adjusted SKAT test (which is asymptotically independent of the burden test)
`pval_SMMAT`	The SMMAT p-value after combining pval_burden and pval_theta using Fisher's method.
`err`	Takes value 1 if there was an error calculating the SMMAT p-value; 0 otherwise. If `err=1`, `pval_SMMAT` is set to `pval_burden`.
`pval_theta.method`	The p-value calculation method used for `pval_theta` (the adjusted SKAT test). When standard SMMAT is used, one of "integration" or "saddlepoint"; when fastSMMAT random projections are used to approximate eigenvalues of the genotype covariance matrix, one of "ssvd_integration" or "ssvd_saddlepoint"; when fastSMMAT random projections are used to approximate both the eigenvalues and the trace of the genotype covariance matrix, one of "rsvd_integration" or "rsvd_saddlepoint".

If test is "SKATO":

`Q_rho`	The SKAT test statistic for the value of rho specified. There will be as many of these variables as there are rho values chosen.
`pval_rho`	The SKAT p-value for the value of rho specified. There will be as many of these variables as there are rho values chosen.
`err_rho`	Takes value 1 if there was an error in calculating the p-value for the value of rho specified when using the "kuonen" or "davies" methods; 0 otherwise. When there is an error, the p-value returned is from the "liu" method. There will be as many of these variables as there are rho values chosen.
`min.pval`	The minimum p-value among the p-values calculated for each choice of rho.
`opt.rho`	The optimal rho value; i.e. the rho value that gave the minimum p-value.
`pval_SKATO`	The SKAT-O p-value after adjustment for searching across multiple rho values.

If test is "BinomiRare" or "CMP":

`n.carrier`	Number of individuals with at least one copy of the effect allele
`n.D.carrier`	Number of cases with at least one copy of the effect allele
`pval`	p-value
`mid.pval`	mid-p-value
`variantInfo`	A list with as many elements as aggregate tests performed. Each element of the list is a data.frame providing information on the variants used in the aggregate test with results presented in the corresponding row of `results`. Each of these data.frames has the following information:
`variant.id`	The variant ID
`chr`	The chromosome value
`pos`	The base pair position
`allele.index`	The index of the alternate allele. For biallelic variants, this will always be 1.
`n.obs`	The number of samples with non-missing genotypes
`freq`	The estimated effect allele frequency
`MAC`	The minor allele count. For multiallelic variants, "minor" is determined by comparing the count of the allele specified by `allele.index` with the sum of all other alleles.
`weight`	The weight assigned to the variant in the analysis.

Author(s)

Matthew P. Conomos, Stephanie M. Gogarten, Thomas Lumley, Tamar Sofer, Ken Rice, Chaoyu Yu, Han Chen

References

Leal, S.M. & Li, B. (2008). Methods for Detecting Associations with Rare Variants for Common Diseases: Application to Analysis of Sequence Data. American Journal of Human Genetics, 83(3): 311-321.

Browning, S.R. & Madsen, B.E. (2009). A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic. PLoS Genetics, 5(2): e1000384.

Wu, M.C, Lee, S., Cai, T., Li, Y., Boehnke, M., & Lin, X. (2011). Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test. American Journal of Human Genetics, 89(1): 82-93.

Lee, S. et al. (2012). Optimal Unified Approach for Rare-Variant Association Testing with Application to Small-Sample Case-Control Whole-Exome Sequencing Studies. American Journal of Human Genetics, 91(2): 224-237.

Chen, H., Huffman, J. E., Brody, J. A., Wang, C., Lee, S., Li, Z., ... & Blangero, J. (2019). Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole-genome sequencing studies. The American Journal of Human Genetics, 104(2), 260-274.

Lumley, T., Brody, J., Peloso, G., Morrison, A., & Rice, K. (2018). FastSKAT: Sequence kernel association tests for very large sets of markers. Genetic epidemiology, 42(6), 516-527.

Examples

library(SeqVarTools)
library(Biobase)
library(GenomicRanges)

# open a sequencing GDS file
gdsfile <- seqExampleFileName("gds")
gds <- seqOpen(gdsfile)

# simulate some phenotype data
set.seed(4)
data(pedigree)
pedigree <- pedigree[match(seqGetData(gds, "sample.id"), pedigree$sample.id),]
pedigree$outcome <- rnorm(nrow(pedigree))

# construct a SeqVarData object
seqData <- SeqVarData(gds, sampleData=AnnotatedDataFrame(pedigree))

# fit the null model
nullmod <- fitNullModel(seqData, outcome="outcome", covars="sex")

# burden test - Range Iterator
gr <- GRanges(seqnames=rep(1,3), ranges=IRanges(start=c(1e6, 2e6, 3e6), width=1e6))
iterator <- SeqVarRangeIterator(seqData, variantRanges=gr)
assoc <- assocTestAggregate(iterator, nullmod, test="Burden",
                            BPPARAM=BiocParallel::SerialParam())
assoc$results
lapply(assoc$variantInfo, head)

# SKAT test - Window Iterator
seqSetFilterChrom(seqData, include="22")
iterator <- SeqVarWindowIterator(seqData)
assoc <- assocTestAggregate(iterator, nullmod, test="SKAT",
                            BPPARAM=BiocParallel::SerialParam())
head(assoc$results)
head(assoc$variantInfo)

# SKAT-O test - List Iterator
seqResetFilter(iterator)
gr <- GRangesList(
  GRanges(seqnames=rep(22,2), ranges=IRanges(start=c(16e6, 17e6), width=1e6)),
  GRanges(seqnames=rep(22,2), ranges=IRanges(start=c(18e6, 20e6), width=1e6)))
iterator <- SeqVarListIterator(seqData, variantRanges=gr)
assoc <- assocTestAggregate(iterator, nullmod, test="SKAT", rho=seq(0, 1, 0.25),
                            BPPARAM=BiocParallel::SerialParam())
assoc$results
assoc$variantInfo

# user-specified weights - option 1
seqResetFilter(iterator)
variant.id <- seqGetData(gds, "variant.id")
weights <- data.frame(variant.id, weight=runif(length(variant.id)))
variantData(seqData) <- AnnotatedDataFrame(weights)
iterator <- SeqVarListIterator(seqData, variantRanges=gr)
assoc <- assocTestAggregate(iterator, nullmod, test="Burden", weight.user="weight",
                            BPPARAM=BiocParallel::SerialParam())
assoc$results
assoc$variantInfo

# user-specified weights - option 2
seqResetFilter(iterator)
variantData(seqData)$weight <- NULL
gr <- GRangesList(
  GRanges(seqnames=rep(22,2), ranges=IRanges(start=c(16e6, 17e6), width=1e6), weight=runif(2)),
  GRanges(seqnames=rep(22,2), ranges=IRanges(start=c(18e6, 20e6), width=1e6), weight=runif(2)))
iterator <- SeqVarListIterator(seqData, variantRanges=gr)
assoc <- assocTestAggregate(iterator, nullmod, test="Burden", weight.user="weight",
                            BPPARAM=BiocParallel::SerialParam())
assoc$results
assoc$variantInfo

seqClose(seqData)
library(SeqVarTools)
library(Biobase)
library(GenomicRanges)

# open a sequencing GDS file
gdsfile <- seqExampleFileName("gds")
gds <- seqOpen(gdsfile)

# simulate some phenotype data
set.seed(4)
data(pedigree)
pedigree <- pedigree[match(seqGetData(gds, "sample.id"), pedigree$sample.id),]
pedigree$outcome <- rnorm(nrow(pedigree))

# construct a SeqVarData object
seqData <- SeqVarData(gds, sampleData=AnnotatedDataFrame(pedigree))

# fit the null model
nullmod <- fitNullModel(seqData, outcome="outcome", covars="sex")

# burden test - Range Iterator
gr <- GRanges(seqnames=rep(1,3), ranges=IRanges(start=c(1e6, 2e6, 3e6), width=1e6))
iterator <- SeqVarRangeIterator(seqData, variantRanges=gr)
assoc <- assocTestAggregate(iterator, nullmod, test="Burden",
                            BPPARAM=BiocParallel::SerialParam())
assoc$results
lapply(assoc$variantInfo, head)

# SKAT test - Window Iterator
seqSetFilterChrom(seqData, include="22")
iterator <- SeqVarWindowIterator(seqData)
assoc <- assocTestAggregate(iterator, nullmod, test="SKAT",
                            BPPARAM=BiocParallel::SerialParam())
head(assoc$results)
head(assoc$variantInfo)

# SKAT-O test - List Iterator
seqResetFilter(iterator)
gr <- GRangesList(
  GRanges(seqnames=rep(22,2), ranges=IRanges(start=c(16e6, 17e6), width=1e6)),
  GRanges(seqnames=rep(22,2), ranges=IRanges(start=c(18e6, 20e6), width=1e6)))
iterator <- SeqVarListIterator(seqData, variantRanges=gr)
assoc <- assocTestAggregate(iterator, nullmod, test="SKAT", rho=seq(0, 1, 0.25),
                            BPPARAM=BiocParallel::SerialParam())
assoc$results
assoc$variantInfo

# user-specified weights - option 1
seqResetFilter(iterator)
variant.id <- seqGetData(gds, "variant.id")
weights <- data.frame(variant.id, weight=runif(length(variant.id)))
variantData(seqData) <- AnnotatedDataFrame(weights)
iterator <- SeqVarListIterator(seqData, variantRanges=gr)
assoc <- assocTestAggregate(iterator, nullmod, test="Burden", weight.user="weight",
                            BPPARAM=BiocParallel::SerialParam())
assoc$results
assoc$variantInfo

# user-specified weights - option 2
seqResetFilter(iterator)
variantData(seqData)$weight <- NULL
gr <- GRangesList(
  GRanges(seqnames=rep(22,2), ranges=IRanges(start=c(16e6, 17e6), width=1e6), weight=runif(2)),
  GRanges(seqnames=rep(22,2), ranges=IRanges(start=c(18e6, 20e6), width=1e6), weight=runif(2)))
iterator <- SeqVarListIterator(seqData, variantRanges=gr)
assoc <- assocTestAggregate(iterator, nullmod, test="Burden", weight.user="weight",
                            BPPARAM=BiocParallel::SerialParam())
assoc$results
assoc$variantInfo

seqClose(seqData)

Genotype Association Testing with Mixed Models

Description

assocTestSingle performs genotype association tests using the null model fit with fitNullModel.

Usage

## S4 method for signature 'SeqVarIterator'
assocTestSingle(gdsobj, null.model,
                test=c("Score", "Score.SPA", "BinomiRare", "CMP"),
                recalc.pval.thresh=0.05, fast.score.SE=FALSE,
                GxE=NULL,
                geno.coding=c("additive", "dominant", "recessive"),
                sparse=TRUE, imputed=FALSE,
                male.diploid=TRUE, genome.build=c("hg19", "hg38"),
                BPPARAM=bpparam(), verbose=TRUE)
## S4 method for signature 'GenotypeIterator'
assocTestSingle(gdsobj, null.model,
                test=c("Score", "Score.SPA", "BinomiRare", "CMP"),
                recalc.pval.thresh=0.05, GxE=NULL,
                geno.coding=c("additive", "dominant", "recessive"),
                male.diploid=TRUE, BPPARAM=bpparam(), verbose=TRUE)
## S4 method for signature 'SeqVarIterator'
assocTestSingle(gdsobj, null.model,
                test=c("Score", "Score.SPA", "BinomiRare", "CMP"),
                recalc.pval.thresh=0.05, fast.score.SE=FALSE,
                GxE=NULL,
                geno.coding=c("additive", "dominant", "recessive"),
                sparse=TRUE, imputed=FALSE,
                male.diploid=TRUE, genome.build=c("hg19", "hg38"),
                BPPARAM=bpparam(), verbose=TRUE)
## S4 method for signature 'GenotypeIterator'
assocTestSingle(gdsobj, null.model,
                test=c("Score", "Score.SPA", "BinomiRare", "CMP"),
                recalc.pval.thresh=0.05, GxE=NULL,
                geno.coding=c("additive", "dominant", "recessive"),
                male.diploid=TRUE, BPPARAM=bpparam(), verbose=TRUE)

Arguments

`gdsobj`	An object of class `SeqVarIterator` from the package SeqVarTools, or an object of class `GenotypeIterator` from the package GWASTools, containing the genotype data for the variants and samples to be used for the analysis.
`null.model`	A null model object returned by `fitNullModel`.
`test`	A character string specifying the type of test to be performed. The possibilities are `"Score"` (default), `"Score.SPA"`, `"BinomiRare"`, or `"CMP"`; `"Score.SPA"`, `"BinomiRare"`, and `"CMP"` can only be used when the family of the null model fit with `fitNullModel` is binomial.
`recalc.pval.thresh`	If test is not "Score", recalculate p-values using the specified 'test' for variants with a Score p-value below this threshold; return the score p-value for all other variants.
`fast.score.SE`	Logical indicator of whether to use the fast approximation of the score standard error for testing variant association. When `FALSE` (default), the true score SE is calculated. When `TRUE`, the fast score SE approximation from SAIGE is used. This option can only be used with a null model fit with `fitNullModelFastScore` or updated with `nullModelFastScore`. See 'Details' for further information.
`GxE`	A vector of character strings specifying the names of the variables for which a genotype interaction term should be included.If `GxE` is not `NULL`, `test` is ignored and Wald tests of interaction are performed. If `GxE` is `NULL` (default) no genotype interactions are included. See 'Details' for further information.
`geno.coding`	Whether genotypes should be coded as "additive" (0, 1, or 2 copies of the effect allele), "recessive" (1=homozygous for the effect allele, 0 otherwise), or "dominant" (1=heterozygous or homozygous for the effect allele, 0 for no effect allele). For recessive coding on sex chromosomes, males are coded as 1 if they are hemizygous for the effect allele.
`sparse`	Logical indicator of whether to read genotypes as sparse Matrix objects; the default is `TRUE`. Set this to `FALSE` if the alternate allele dosage of the genotypes in the test are not expected to be mostly 0.
`imputed`	Logical indicator of whether to read dosages from the "DS" field containing imputed dosages instead of counting the number of alternate alleles.
`male.diploid`	Logical for whether males on sex chromosomes are coded as diploid. Default is 'male.diploid=TRUE', meaning sex chromosome genotypes for males have values 0/2. If the input `gdsobj` codes males as 0/1 on sex chromosomes, set 'male.diploid=FALSE'.
`genome.build`	A character sting indicating genome build; used to identify pseudoautosomal regions on the X and Y chromosomes. These regions are not treated as sex chromosomes when calculating allele frequencies.
`BPPARAM`	A `BiocParallelParam` object to process blocks of variants in parallel. If not provided, the default back-end returned by `bpparam` will be used.
`verbose`	Logical indicator of whether updates from the function should be printed to the console; the default is `TRUE`.

Details

assocTestSingle uses the BiocParallel package to process iterator chunks in parallel. See the BiocParallel documentation for more information on the default behaviour of bpparam and how to register different parallel backends. If serial execution is desired, set BPPARAM=BiocParallel::SerialParam(). Note that parallel execution requires more RAM than serial execution.

All samples included in null model will be included in the association test, even if a different set of samples is present in the current filter for gdsobj.

Sporadic missing genotype values are mean imputed using the allele frequency calculated on all other samples at that variant.

Monomorphic variants (including variants where every sample is a heterozygote) are omitted from the results.

The input GxE can be used to perform GxE tests. Multiple interaction variables may be specified, but all interaction variables specified must have been included as covariates in fitting the null model with fitNullModel. When performing GxE analyses, assocTestSingle will report two tests: (1) the joint Wald test of all genotype interaction terms in the model (this is the test for any genotype interaction effect), and (2) the joint Wald test of the genotype term along with all of the genotype interaction terms (this is the test for any genetic effect). Individual genotype interaction terms can be tested by creating test statistics from the reported effect size estimates and their standard errors (Note: when GxE contains a single continuous or binary covariate, this test is the same as the test for any genotype interaction effect mentioned above).

The saddle point approximation (SPA), run by using test = "Score.SPA", implements the method described by Dey et al. (2017), which was extended to mixed models by Zhou et al. (2018) in their SAIGE software. SPA provides better calibration of p-values when fitting mixed models with a binomial family for a sample with an imbalanced case to control ratio.

The fast approximation to the score standard error for testing variant association used by Zhou et al. (2018) in their SAIGE software is available by setting the fast.score.SE parameter to TRUE. This approximation may be much faster than computing the true score SE in large samples, as it replaces the full covariance matrix in the calculation with the product of a diagonal matrix and a scalar correction factor. This scalar correction factor must be computed beforehand and stored in the input null.model as se.correction, either by fitting the null.model with fitNullModelFastScore, or by updating a null.model previously fit with fitNullModel using the calcScore and nullModelFastScore functions. This approach assumes a constant scalar SE correction factor across all variants. This method is only available when gdsobj is a SeqVarIterator object.

For the GenotypeIterator method, objects created with GdsGenotypeReader or MatrixGenotypeReader are supported. NcdfGenotypeReader objects are not supported.

p-values that are calculated using pchisq and are smaller than .Machine\$double.xmin are set to .Machine\$double.xmin.

Value

A data.frame where each row refers to a different variant with the columns:

`variant.id`	The variant ID
`chr`	The chromosome value
`pos`	The base pair position
`allele.index`	The index of the alternate allele. For biallelic variants, this will always be 1.
`n.obs`	The number of samples with non-missing genotypes
`freq`	The estimated effect allele frequency
`MAC`	The minor allele count. For multiallelic variants, "minor" is determined by comparing the count of the allele specified by `allele.index` with the sum of all other alleles.

If geno.coding is "recessive":

n.hom.eff

The number of samples homozygous for the effect allele.

If geno.coding is "dominant":

n.any.eff

The number of samples with any copies of the effect allele.

If test is "Score":

`Score`	The value of the score function
`Score.SE`	The estimated standard error of the Score
`Score.Stat`	The score Z test statistic
`Score.pval`	The score p-value
`Est`	An approximation of the effect size estimate for each additional copy of the effect allele
`Est.SE`	An approximation of the standard error of the effect size estimate
`PVE`	An approximation of the proportion of phenotype variance explained

If test is "Score.SPA":

`SPA.pval`	The score p-value after applying the saddle point approximation (SPA)
`SPA.converged`	logical indiactor of whether the SPA converged; `NA` indicates that the SPA was not applied and the original Score.pval was returned

If GxE is not NULL:

`Est.G`	The effect size estimate for the genotype term
`Est.G:env`	The effect size estimate for the genotype*env interaction term. There will be as many of these terms as there are interaction variables, and "env" will be replaced with the variable name.
`SE.G`	The estimated standard error of the genotype term effect size estimate
`SE.G:env`	The estimated standard error of the genotype*env effect size estimate. There will be as many of these terms as there are interaction variables, and "env" will be replaced with the variable name.
`GxE.Stat`	The Wald Z test statistic for the test of all genotype interaction terms. When there is only one genotype interaction term, this is the test statistic for that term.
`GxE.pval`	The Wald p-value for the test of all genotype interaction terms; i.e. the test of any genotype interaction effect
`Joint.Stat`	The Wald Z test statistic for the joint test of the genotype term and all of the genotype interaction terms
`Joint.pval`	The Wald p-value for the joint test of the genotype term and all of the genotype interaction terms; i.e. the test of any genotype effect

If test is "BinomiRare" or "CMP":

`n.carrier`	Number of individuals with at least one copy of the effect allele
`n.D.carrier`	Number of cases with at least one copy of the effect allele
`pval`	p-value
`mid.pval`	mid-p-value

Author(s)

Matthew P. Conomos, Stephanie M. Gogarten, Tamar Sofer, Ken Rice, Chaoyu Yu

References

Dey, R., Schmidt, E. M., Abecasis, G. R., & Lee, S. (2017). A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. The American Journal of Human Genetics, 101(1), 37-49.

Sofer, T. (2017). BinomiRare: A robust test of the association of a rare variant with a disease for pooled analysis and meta-analysis, with application to the HCHS/SOL. Genetic Epidemiology, 41(5), 388-395.

Sofer, T. & Guo, N. (2020). Rare variants association testing for a binary outcome when pooling individual level data from heterogeneous studies. https://www.biorxiv.org/content/10.1101/2020.04.17.047530v1.

Zhou, W., Nielsen, J. B., Fritsche, L. G., Dey, R., Gabrielsen, M. E., Wolford, B. N., ... & Bastarache, L. A. (2018). Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nature genetics, 50(9), 1335.

Examples

library(SeqVarTools)
library(Biobase)

# open a sequencing GDS file
gdsfile <- seqExampleFileName("gds")
gds <- seqOpen(gdsfile)

# simulate some phenotype data
set.seed(4)
data(pedigree)
pedigree <- pedigree[match(seqGetData(gds, "sample.id"), pedigree$sample.id),]
pedigree$outcome <- rnorm(nrow(pedigree))

# construct a SeqVarIterator object
seqData <- SeqVarData(gds, sampleData=AnnotatedDataFrame(pedigree))
iterator <- SeqVarBlockIterator(seqData)

# fit the null model
nullmod <- fitNullModel(iterator, outcome="outcome", covars="sex")

# run the association test
assoc <- assocTestSingle(iterator, nullmod,
                         BPPARAM=BiocParallel::SerialParam())

# use fast score SE for a null model with a covariance matrix
seqResetFilter(seqData)
grm <- SNPRelate::snpgdsGRM(seqData, verbose=FALSE)
covmat <- grm$grm; dimnames(covmat) <- list(grm$sample.id, grm$sample.id)
set.seed(5)
nullmod <- fitNullModelFastScore(iterator, outcome="outcome", covars="sex", cov.mat=covmat)
assoc.se <- assocTestSingle(iterator, nullmod, fast.score.SE=TRUE,
                            BPPARAM=BiocParallel::SerialParam())

seqClose(iterator)


library(GWASTools)

# open a SNP-based GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
gds <- GdsGenotypeReader(filename = gdsfile)

# simulate some phenotype data
set.seed(4)
pheno <- data.frame(scanID=getScanID(gds),
                    outcome=rnorm(nscan(gds)))

# construct a GenotypeIterator object
genoData <- GenotypeData(gds, scanAnnot=ScanAnnotationDataFrame(pheno))
iterator <- GenotypeBlockIterator(genoData)

# fit the null model
nullmod <- fitNullModel(iterator, outcome="outcome")

# run the association test
assoc <- assocTestSingle(iterator, nullmod,
                         BPPARAM=BiocParallel::SerialParam())

close(iterator)
library(SeqVarTools)
library(Biobase)

# open a sequencing GDS file
gdsfile <- seqExampleFileName("gds")
gds <- seqOpen(gdsfile)

# simulate some phenotype data
set.seed(4)
data(pedigree)
pedigree <- pedigree[match(seqGetData(gds, "sample.id"), pedigree$sample.id),]
pedigree$outcome <- rnorm(nrow(pedigree))

# construct a SeqVarIterator object
seqData <- SeqVarData(gds, sampleData=AnnotatedDataFrame(pedigree))
iterator <- SeqVarBlockIterator(seqData)

# fit the null model
nullmod <- fitNullModel(iterator, outcome="outcome", covars="sex")

# run the association test
assoc <- assocTestSingle(iterator, nullmod,
                         BPPARAM=BiocParallel::SerialParam())

# use fast score SE for a null model with a covariance matrix
seqResetFilter(seqData)
grm <- SNPRelate::snpgdsGRM(seqData, verbose=FALSE)
covmat <- grm$grm; dimnames(covmat) <- list(grm$sample.id, grm$sample.id)
set.seed(5)
nullmod <- fitNullModelFastScore(iterator, outcome="outcome", covars="sex", cov.mat=covmat)
assoc.se <- assocTestSingle(iterator, nullmod, fast.score.SE=TRUE,
                            BPPARAM=BiocParallel::SerialParam())

seqClose(iterator)


library(GWASTools)

# open a SNP-based GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
gds <- GdsGenotypeReader(filename = gdsfile)

# simulate some phenotype data
set.seed(4)
pheno <- data.frame(scanID=getScanID(gds),
                    outcome=rnorm(nscan(gds)))

# construct a GenotypeIterator object
genoData <- GenotypeData(gds, scanAnnot=ScanAnnotationDataFrame(pheno))
iterator <- GenotypeBlockIterator(genoData)

# fit the null model
nullmod <- fitNullModel(iterator, outcome="outcome")

# run the association test
assoc <- assocTestSingle(iterator, nullmod,
                         BPPARAM=BiocParallel::SerialParam())

close(iterator)

Computes variant-specific inflation factors

Description

Computes variant-specific inflation factors resulting from differences in variances and allele frequencies across groups pooled together in analysis.

Usage

computeVSIF(freq, n, sigma.sq)
computeVSIFNullModel(null.model, freq, group.var.vec)
computeVSIF(freq, n, sigma.sq)
computeVSIFNullModel(null.model, freq, group.var.vec)

Arguments

`freq`	A named vector or a matrix of effect allele frequencies across groups. Vector/column names are group names; rows (for a matrix) are variants.
`n`	A named vector of group sample sizes.
`sigma.sq`	A named vector of residual variances across groups.
`null.model`	A null model constructed with `fitNullModel`.
`group.var.vec`	A named vector of group membership. Names are sample.ids, values are group names.

Details

computeVSIF computes the expected inflation/deflation for each specific variant caused by differences in allele frequencies in combination with differences in residual variances across groups that are aggregated together (e.g. individuals with different genetic ancestry patterns). The inflation/deflation is especially expected if a homogeneous variance model is used.

computeVSIFNullModel uses the null model and vector of group membership to extract sample sizes and residual variances for each group. It then calls function computeVSIF to compute the inflation factors. The null model should be fit under a homogeneous variance model.

Value

`SE_true`	Large sample test statistic variances accounting for differences in residual variances.
`SE_naive`	Large sample test statistic variances (wrongly) assuming that all residual variances are the same across groups.
`Inflation_factor`	Variant-specific inflation factors. Values higher than 1 suggest inflation (too significant p-value), values lower than 1 suggest deflation (too high p-value).

Author(s)

Tamar Sofer, Kenneth Rice

References

Sofer, T., Zheng, X., Laurie, C. A., Gogarten, S. M., Brody, J. A., Conomos, M. P., ... & Rice, K. M. (2020). Population Stratification at the Phenotypic Variance level and Implication for the Analysis of Whole Genome Sequencing Data from Multiple Studies. BioRxiv.

Examples

  n <- c(2000, 5000, 100)
  sigma.sq <- c(1, 1, 2)
  freq.vec <- c(0.1, 0.2, 0.5)
  names(freq.vec) <- names(n) <- names(sigma.sq) <- c("g1", "g2", "g3") 
  res <- computeVSIF(freq = freq.vec, n, sigma.sq)
  
  freq.mat <- matrix(c(0.1, 0.2, 0.5, 0.1, 0.01, 0.5), nrow = 2, byrow = TRUE)
  colnames(freq.mat) <- names(sigma.sq)
  res <- computeVSIF(freq = freq.mat, n, sigma.sq)


  library(GWASTools)
  n <- 1000
  set.seed(22)
  outcome <- c(rnorm(n*0.28, sd =1), rnorm(n*0.7, sd = 1), rnorm(n*0.02, sd = sqrt(2)) )
  dat <- data.frame(sample.id=paste0("ID_", 1:n),
                    outcome = outcome,
                    b=c(rep("g1", n*0.28), rep("g2", n*0.7), rep("g3", n*0.02)),
                    stringsAsFactors=FALSE)
  dat <- AnnotatedDataFrame(dat)
  nm <- fitNullModel(dat, outcome="outcome", covars="b", verbose=FALSE)
  freq.vec <- c(0.1, 0.2, 0.5)
  names(freq.vec) <- c("g1", "g2", "g3") 
  group.var.vec <- dat$b
  names(group.var.vec) <- dat$sample.id
  
  res <- computeVSIFNullModel(nm, freq.vec, group.var.vec)
  
  freq.mat <- matrix(c(0.1, 0.2, 0.5, 0.1, 0.01, 0.5), nrow = 2, byrow = TRUE)
  colnames(freq.mat) <- c("g1", "g2", "g3")
  res <- computeVSIFNullModel(nm, freq.mat, group.var.vec)
n <- c(2000, 5000, 100)
  sigma.sq <- c(1, 1, 2)
  freq.vec <- c(0.1, 0.2, 0.5)
  names(freq.vec) <- names(n) <- names(sigma.sq) <- c("g1", "g2", "g3") 
  res <- computeVSIF(freq = freq.vec, n, sigma.sq)
  
  freq.mat <- matrix(c(0.1, 0.2, 0.5, 0.1, 0.01, 0.5), nrow = 2, byrow = TRUE)
  colnames(freq.mat) <- names(sigma.sq)
  res <- computeVSIF(freq = freq.mat, n, sigma.sq)


  library(GWASTools)
  n <- 1000
  set.seed(22)
  outcome <- c(rnorm(n*0.28, sd =1), rnorm(n*0.7, sd = 1), rnorm(n*0.02, sd = sqrt(2)) )
  dat <- data.frame(sample.id=paste0("ID_", 1:n),
                    outcome = outcome,
                    b=c(rep("g1", n*0.28), rep("g2", n*0.7), rep("g3", n*0.02)),
                    stringsAsFactors=FALSE)
  dat <- AnnotatedDataFrame(dat)
  nm <- fitNullModel(dat, outcome="outcome", covars="b", verbose=FALSE)
  freq.vec <- c(0.1, 0.2, 0.5)
  names(freq.vec) <- c("g1", "g2", "g3") 
  group.var.vec <- dat$b
  names(group.var.vec) <- dat$sample.id
  
  res <- computeVSIFNullModel(nm, freq.vec, group.var.vec)
  
  freq.mat <- matrix(c(0.1, 0.2, 0.5, 0.1, 0.01, 0.5), nrow = 2, byrow = TRUE)
  colnames(freq.mat) <- c("g1", "g2", "g3")
  res <- computeVSIFNullModel(nm, freq.mat, group.var.vec)

Return the effect allele for association testing

Description

effectAllele returns the effect allele for association testing.

Usage

## S4 method for signature 'SeqVarGDSClass'
effectAllele(gdsobj, variant.id=NULL)
## S4 method for signature 'GenotypeData'
effectAllele(gdsobj, variant.id=NULL)
## S4 method for signature 'SeqVarGDSClass'
effectAllele(gdsobj, variant.id=NULL)
## S4 method for signature 'GenotypeData'
effectAllele(gdsobj, variant.id=NULL)

Arguments

`gdsobj`	An object of class `SeqVarIterator` from the package SeqVarTools, or an object of class `GenotypeIterator` from the package GWASTools, containing the genotype data for the variants and samples to be used for the analysis.
`variant.id`	A vector of identifiers for variants to return.

Details

effectAllele returns the effect allele corresponding to association test results from assocTestSingle or assocTestAggregate. variant.id allows the user to specify for which variants effect alleles should be returned.

Value

A data.frame with the following columns:

`variant.id`	The variant ID
`effect.allele`	The character value for the effect allele
`other.allele`	The character value for the other (non-effect) allele

Author(s)

Stephanie M. Gogarten

Examples

library(SeqVarTools)
library(Biobase)

# open a sequencing GDS file
gdsfile <- seqExampleFileName("gds")
gds <- seqOpen(gdsfile)

# simulate some phenotype data
set.seed(4)
data(pedigree)
pedigree <- pedigree[match(seqGetData(gds, "sample.id"), pedigree$sample.id),]
pedigree$outcome <- rnorm(nrow(pedigree))

# construct a SeqVarIterator object
seqData <- SeqVarData(gds, sampleData=AnnotatedDataFrame(pedigree))
iterator <- SeqVarBlockIterator(seqData)

# fit the null model
nullmod <- fitNullModel(iterator, outcome="outcome", covars="sex")

# run the association test
assoc <- assocTestSingle(iterator, nullmod)

# add effect allele to the results
eff <- effectAllele(seqData, variant.id=assoc$variant.id)
assoc <- dplyr::left_join(assoc,eff)
head(assoc)

seqClose(iterator)
library(SeqVarTools)
library(Biobase)

# open a sequencing GDS file
gdsfile <- seqExampleFileName("gds")
gds <- seqOpen(gdsfile)

# simulate some phenotype data
set.seed(4)
data(pedigree)
pedigree <- pedigree[match(seqGetData(gds, "sample.id"), pedigree$sample.id),]
pedigree$outcome <- rnorm(nrow(pedigree))

# construct a SeqVarIterator object
seqData <- SeqVarData(gds, sampleData=AnnotatedDataFrame(pedigree))
iterator <- SeqVarBlockIterator(seqData)

# fit the null model
nullmod <- fitNullModel(iterator, outcome="outcome", covars="sex")

# run the association test
assoc <- assocTestSingle(iterator, nullmod)

# add effect allele to the results
eff <- effectAllele(seqData, variant.id=assoc$variant.id)
assoc <- dplyr::left_join(assoc,eff)
head(assoc)

seqClose(iterator)

Fit a Model Under the Null Hypothesis

Description

fitNullModel fits a regression model or a mixed model with random effects specified by their covariance structures; this allows for the inclusion of a polygenic random effect using a kinship matrix or genetic relationship matrix (GRM). The output of fitNullModel can be used to estimate genetic heritability and can be passed to assocTestSingle or assocTestAggregate for the purpose of genetic association testing.

nullModelInvNorm does an inverse normal transform of a previously fit null model.

nullModelSmall returns a small version of the null model with no NxN matrices.

isNullModelSmall returns TRUE if a null model is small; FALSE otherwise.

fitNullModelFastScore fits a null model that can be used for association testing with the fast approximation to the score standard error (SE).

calcScore calculates the score, its true SE, and the fast SE for a set of variants; used to compute the SE correction factor used for the fast approximation.

nullModelFastScore updates a previously fit null model so that it can be used for association testing with the fast approximation to the score SE.

isNullModelFastScore returns TRUE if a null model can be used for association testing with the fast approximation to the score SE; FALSE otherwise.

Usage

## S4 method for signature 'data.frame'
fitNullModel(x, outcome, covars = NULL, cov.mat = NULL,
            group.var = NULL, family = "gaussian", two.stage = FALSE,
            norm.option = c("all", "by.group"), rescale = c("residSD", "none", "model"),
            start = NULL, AIREML.tol = 1e-4, max.iter = 100, EM.iter = 0,
            drop.zeros = TRUE, return.small = FALSE, verbose = TRUE)
## S4 method for signature 'AnnotatedDataFrame'
fitNullModel(x, outcome, covars = NULL, cov.mat = NULL,
            group.var = NULL, sample.id = NULL, ...)
## S4 method for signature 'SeqVarData'
fitNullModel(x, ...)
## S4 method for signature 'ScanAnnotationDataFrame'
fitNullModel(x, ...)
## S4 method for signature 'GenotypeData'
fitNullModel(x, ...)

nullModelInvNorm(null.model, cov.mat = NULL,
                 norm.option = c("all", "by.group"),
                 rescale = c("residSD", "none", "model"),
                 AIREML.tol = 1e-4, max.iter = 100, EM.iter = 0,
                 drop.zeros = TRUE, return.small = FALSE, verbose = TRUE)

nullModelSmall(null.model)

isNullModelSmall(null.model)

## S4 method for signature 'SeqVarData'
fitNullModelFastScore(x, outcome, covars = NULL, cov.mat = NULL,
            group.var = NULL, family = "gaussian", two.stage = FALSE,
            norm.option = c("all", "by.group"), rescale = c("residSD", "none", "model"),
            start = NULL, AIREML.tol = 1e-4, max.iter = 100, EM.iter = 0,
            drop.zeros = TRUE, return.small = TRUE,
            variant.id = NULL, nvar = 100, min.mac = 20, sparse = TRUE,
            imputed = FALSE, male.diploid = TRUE, genome.build = c("hg19", "hg38"),
            verbose = TRUE)

calcScore(x, null.model, variant.id = NULL, nvar = 100, min.mac = 20, sparse = TRUE,
          imputed = FALSE, male.diploid = TRUE, genome.build = c("hg19", "hg38"),
          verbose = TRUE)

nullModelFastScore(null.model, score.table, return.small = TRUE, verbose = TRUE)

isNullModelFastScore(null.model)
## S4 method for signature 'data.frame'
fitNullModel(x, outcome, covars = NULL, cov.mat = NULL,
            group.var = NULL, family = "gaussian", two.stage = FALSE,
            norm.option = c("all", "by.group"), rescale = c("residSD", "none", "model"),
            start = NULL, AIREML.tol = 1e-4, max.iter = 100, EM.iter = 0,
            drop.zeros = TRUE, return.small = FALSE, verbose = TRUE)
## S4 method for signature 'AnnotatedDataFrame'
fitNullModel(x, outcome, covars = NULL, cov.mat = NULL,
            group.var = NULL, sample.id = NULL, ...)
## S4 method for signature 'SeqVarData'
fitNullModel(x, ...)
## S4 method for signature 'ScanAnnotationDataFrame'
fitNullModel(x, ...)
## S4 method for signature 'GenotypeData'
fitNullModel(x, ...)

nullModelInvNorm(null.model, cov.mat = NULL,
                 norm.option = c("all", "by.group"),
                 rescale = c("residSD", "none", "model"),
                 AIREML.tol = 1e-4, max.iter = 100, EM.iter = 0,
                 drop.zeros = TRUE, return.small = FALSE, verbose = TRUE)

nullModelSmall(null.model)

isNullModelSmall(null.model)

## S4 method for signature 'SeqVarData'
fitNullModelFastScore(x, outcome, covars = NULL, cov.mat = NULL,
            group.var = NULL, family = "gaussian", two.stage = FALSE,
            norm.option = c("all", "by.group"), rescale = c("residSD", "none", "model"),
            start = NULL, AIREML.tol = 1e-4, max.iter = 100, EM.iter = 0,
            drop.zeros = TRUE, return.small = TRUE,
            variant.id = NULL, nvar = 100, min.mac = 20, sparse = TRUE,
            imputed = FALSE, male.diploid = TRUE, genome.build = c("hg19", "hg38"),
            verbose = TRUE)

calcScore(x, null.model, variant.id = NULL, nvar = 100, min.mac = 20, sparse = TRUE,
          imputed = FALSE, male.diploid = TRUE, genome.build = c("hg19", "hg38"),
          verbose = TRUE)

nullModelFastScore(null.model, score.table, return.small = TRUE, verbose = TRUE)

isNullModelFastScore(null.model)

Arguments

`x`	An object of class `data.frame`, `AnnotatedDataFrame`, or `SeqVarData` containing the outcome and covariate data for the samples to be used for the analysis. Must be class `SeqVarData` when using `fitNullModelFastScore` or `calcScore`. See 'Details' for more information.
`outcome`	A character string specifying the name of the outcome variable in `x`.
`covars`	A vector of character strings specifying the names of the fixed effect covariates in `x`; an intercept term is automatically included. If `NULL` (default), the only fixed effect covariate is the intercept term.
`cov.mat`	A matrix or list of matrices specifying the covariance structures of the random effects terms. Objects from the Matrix package are supported. If `NULL` (default), no random effects terms are included. See 'Details' for more information.
`group.var`	This variable can only be used when `family = "gaussian"`. A character string specifying the name of a categorical variable in `x` that is used to fit heterogeneous residual error variances. If `NULL` (default), then a standard LMM with constant residual variance for all samples is fit. See 'Details' for more information.
`sample.id`	A vector of IDs for samples to include in the analysis. If `NULL`, all samples in `x` are included. This argument is ignored if `x` is a data.frame; see 'Details'.
`family`	A description of the error distribution to be used in the model. The default `"gaussian"` fits a linear model; `"binomial"` and `"poisson"` are also supported. See 'Details' for more information.
`two.stage`	Logical indicator of whether to use a fully-adjusted two-stage rank normalization procedure for fitting the model. Can only be used when `family = "gaussian"`. See 'Details' for more information.
`norm.option`	Specifies whether the rank normalization should be done separately within each value of `group.var` (`"by.group"`) or with all samples together (`"all"`) when using `two.stage` or `nullModelInvNorm.`
`rescale`	Specifies how to rescale the residuals after rank normalization when using `two.stage` or `nullModelInvNorm.`: `"none"` for no rescaling of the residuals; `"model"` to rescale by the model-based variance components, and `"residSD"` (default) to rescale by the standard deviation of the original marginal residuals. Rescaling is done with the same grouping as the rank normalization, as specified by `norm.option`.
`start`	A vector of starting values for the variance component estimation procedure. The function will pick reasonable starting values when left `NULL` (default). See 'Details' for more information.
`AIREML.tol`	The convergence threshold for the Average Information REML (AIREML) procedure used to estimate the variance components of the random effects. See 'Details' for more information.
`max.iter`	The maximum number of iterations allowed to reach convergence.
`EM.iter`	The number of EM iterations to run prior to AIREML; default is 0.
`drop.zeros`	Logical indicator of whether variance component terms that converge to 0 should be removed from the model; the default is `TRUE`. See 'Details' for more information.
`return.small`	Logical for whether to return a small version of the null model without NxN matrices. Default for `fitNullModel` is `FALSE`; only set to TRUE for use in association tests with `test = "BinomiRare"` or `test = "CMP"` and `recalc.pval.thresh = 1`. Default for `fitNullModelFastScore` is `TRUE`, as NxN matrices are not needed for the fast score SE approximation.
`null.model`	The output of `fitNullModel`.
`variant.id`	Optional list of variant.ids in `x` specifying which variants to use for computing the SE correction factor; if `NULL`, a random selection of `nvar` variants with minor allele count at least `min.mac` is used.
`nvar`	The number of random variants to select from `x` for computing the SE correction factor; ignored if `variant.id` is specified.
`min.mac`	The minimum minor allele count allowed for the random variants selected from `x` for computing the SE correction factor; ignored if `variant.id` is specified.
`sparse`	Logical indicator of whether to read genotypes as sparse Matrix objects; the default is `TRUE`. Set this to `FALSE` if the alternate allele dosage of the genotypes in the test are not expected to be mostly 0.
`imputed`	Logical indicator of whether to read dosages from the "DS" field containing imputed dosages instead of counting the number of alternate alleles.
`male.diploid`	Logical for whether males on sex chromosomes are coded as diploid.
`genome.build`	A character sting indicating genome build; used to identify pseudoautosomal regions on the X and Y chromosomes.
`verbose`	Logical indicator of whether updates from the function should be printed to the console; the default is TRUE.
`...`	Arguments to pass to other methods.
`score.table`	The output of `calcScore`.

Details

If x is a data.frame, the rownames of x must match the row and column names of cov.mat (if cov.mat is specified). If x is an AnnotatedDataFrame or other object containing an AnnotatedDataFrame, x will be re-ordered (if necessary) so that sample.id or scanID is in the same order as the row and column names of cov.mat.

If any covariates have the same value for all samples, they will be dropped from the model with a message. Note that the 'model' and 'covars' element in the output object will still include that covariate.

The code checks for multicollinearity of covariates by checking that the rank of the design matrix is equal to the number of columns; if the rank is smaller, it fails with an error.

cov.mat is used to specify the covariance structures of the random effects terms in the model. For example, to include a polygenic random effect, one matrix in cov.mat could be a kinship matrix or a genetic relationship matrix (GRM). As another example, to include household membership as a random effect, one matrix in cov.mat should be a 0/1 matrix with a 1 in the [i,j] and [j,i] entries if individuals i and j are in the same household and 0 otherwise; the diagonals of such a matrix should all be 1. If cov.mat is a list without names, the code will assign sequential letters as names. If some elements are named but not others, it will produce an error.

For some outcomes, there may be evidence that different groups of observations have different residual variances, and the standard LMM assumption of homoscedasticity is violated. When group.var is specified, separate (heterogeneous) residual variance components are fit for each unique value of group.var. This parameter is only applicable when family = "gaussian".

When family is not gaussian, the penalized quasi-likelihood (PQL) approximation to the generalized linear mixed model (GLMM) is fit following the procedure of GMMAT (Chen et al., 2016).

Let m be the number of matrices in cov.mat and let g be the number of categories in the variable specified by group.var. The length of the start vector must be (m + 1) when family is gaussian and group.var is NULL; (m + g) when family is gaussian and group.var is specified; or m when family is not gaussian.

A Newton-Raphson iterative procedure with Average Information REML (AIREML) is used to estimate the variance components of the random effects. When the absolute change between all of the new and previous variance component estimates is less than var(outcome)*AIREML.tol, the algorithm declares convergence of the estimates. Sometimes a variance component may approach the boundary of the parameter space at 0; step-halving is used to prevent any component from becomming negative. However, when a variance component gets near the 0 boundary, the algorithm can sometimes get "stuck", preventing the other variance components from converging; if drop.zeros is TRUE, then variance components that converge to a value less than AIREML.tol will be dropped from the model and the estimation procedure will continue with the remaining variance components.

When two.stage = TRUE, the fully-adjusted two-stage rank normalization procedure from Sofer et. al. (2019) is used. With this procedure: the stage 1 model is fit as usual, with the specified outcome, covars, cov.mat, and group.var; the marginal residuals from the stage 1 model are rank normalized as specified by norm.option and then rescaled as specified by rescale; the stage 2 model is then fit using the rank normalized and rescaled residuals as the new outcome, with the same covars, cov.mat, and group.var as the stage 1 model. The output of this stage 2 model is saved and should be used for association testing. This procedure is only applicable when family = "gaussian". The nullModelInvNorm function takes a previously fit null model as input and does the rank normalization, rescaling, and stage 2 model fitting.

The function fitNullModelFastScore fits a null model that can be used for testing variant association with the fast approximation to the score standard error (SE) implemented by Zhou et al. (2018) in their SAIGE software. This approximation may be much faster than computing the true score SE in large samples, as it replaces the full covariance matrix in the calculation with the product of a diagonal matrix and a scalar correction factor (se.correction in the null model output); see the option fast.Score.SE in assocTestSingle for further details. A null model previously fit with fitNullModel can be updated to this format by using calcScore to compute the necessary statistics on a set of variants, followed by nullModelFastScore to calculate the se.correction factor and update the null model format.

p-values that are calculated using pchisq and are smaller than .Machine\$double.xmin are set to .Machine\$double.xmin.

Value

An object of class 'GENESIS.nullModel' or 'GENESIS.nullMixedModel'. A list including:

model: A list with elements describing the model that was fit. See below for explanations of the elements in this list.
varComp: The variance component estimates. There is one variance component for each random effect specified in cov.mat. When family is gaussian, there are additional residual variance components; one residual variance component when group.var is NULL, and as many residual variance components as there are unique values of group.var when it is specified.
varCompCov: The estimated covariance matrix of the variance component estimates given by varComp. This can be used for hypothesis tests regarding the variance components.
fixef: A data.frame with effect size estimates (betas), standard errors, chi-squared test statistics, and p-values for each of the fixed effect covariates specified in covars.
betaCov: The estimated covariance matrix of the effect size estimates (betas) of the fixed effect covariates. This can be used for hypothesis tests regarding the fixed effects.
fit: A data frame with the outcome, fitted values, and residuals. See below for explanations of the columns of this data frame.
logLik: The log-likelihood value.
logLikR: The restricted log-likelihood value.
AIC: The Akaike Information Criterion value.
model.matrix: The design matrix for the fixed effect covariates used in the model.
group.idx: If group.var is not NULL, a list of indices for samples in each group.
cholSigmaInv: The Cholesky decomposition of the inverse of the estimated outcome covariance structure. This is used by assocTestSingle or assocTestAggregate for genetic association testing.
W: The diagonal weight matrix with terms given by the variance function for the specified family. This is the inverse of portion of the estimated outcome covariance structure not attributed to random effects specified with cov.mat. This matrix is used in place of the inverse of the estimated outcome covariance structure when using fast.score.SE for association testing with assocTestSingle.
converged: A logical indicator of whether the AIREML procedure for estimating the random effects variance components converged.
zeroFLAG: A vector of logicals the same length as varComp specifying whether the corresponding variance component estimate was set to 0 by the function due to convergence to the boundary in the AIREML procedure.
RSS: The residual sum of squares from the model fit. When family is gaussian, this will typically be 1 since the residual variance component is estimated separately.
RSS0: the sum of resid.cholesky squared. This is the sum of squared residuals under the null hypothesis of no genetic effect for the covariate and random effect adjusted model using the Frisch-Waugh-Lovell theorem.
CX: a matrix needed for calculating association test statistics
CXCXI: a matrix needed for calculating association test statistics
se.correction: The scalar correction factor for the fast approximation to the score SE; the average of the se.ratio values in score.table.
score.table: A data frame with information about the variants used to compute the se.correction factor.

The fit data frame contains the following columns, depending on which type of model was fit:

outcome: The original outcome vector.
workingY: The "working" outcome vector. When family is gaussian, this is just the original outcome vector. When family is not gaussian, this is the PQL linearization of the outcome vector. This is used by assocTestSingle or assocTestAggregate for genetic association testing. See 'Details' for more information.
fitted.values: The fitted values from the model. For mixed models, this is X*beta where X is the design matrix and beta is the vector of effect size estimates for the fixed effects. For non-mixed models, this is the inverse link function applied to X*beta (e.g., expit(X*beta) for logistic regression when family = "binomial").
resid.marginal: The marginal residuals from the model; i.e. Y - X*beta where Y is the vector of outcome values.
linear.predictor: The linear predictor from the model; i.e. X*beta + Z*u, where Z*u represents the effects captured by the random effects specified with the cov.mat input.
resid.conditional: The conditional residuals from the model; i.e. Y - X*beta - Z*u, where Z*u represents the effects captured by the random effects specified with the cov.mat input.
resid.cholesky: The Cholesky residuals from the model; a transformation of the marginal residuals using the estimated model covariance structure such that they are uncorrelated and follow a standard multivariate Normal distribution with mean 0 and identity covariance matrix asymptotically. Linear regression of the Cholesky residual vector on an equivalently transformed genotype vector provides the same estimates as fitting the full GLS model (by the Frisch-Waugh-Lovell theorem).
resid.PY: The outcome vector (Y) pre-multiplied by a projection matrix (P) that adjusts for covariates, random effects, and model covariance structure. These projected outcome values are essentially what are correlated with genotype values for association testing.
sample.id: A vector of IDs for the samples used in the analysis, if called with an AnnotatedDataFrame.

The model list element contains the following elements:

outcome: The outcome variable name.
covars: A vector of covariate names
formula: The model string.
family: A character string specifying the family used in the analysis.
hetResid: A logical indicator of whether heterogeneous residual variance components were used in the model (specified by group.var).

The score.table data frame contains the following columns:

variant.id: The variant ID
chr: The chromosome value
pos: The base pair position
allele.index: The index of the alternate allele. For biallelic variants, this will always be 1.
n.obs: The number of samples with non-missing genotypes
freq: The estimated alternate allele frequency
MAC: The minor allele count. For multiallelic variants, "minor" is determined by comparing the count of the alternate allele specified by allele.index with the sum of all other alleles.
Score: The value of the score function
Score.SE: The estimated true standard error of the Score
Score.SE.fast: The estimated fast standard error of the Score (before scalar correction)
se.ratio: The ratio of Score.SE to Score.SE.fast; these values are averaged across varaints to estimate se.correction in nullModelFastScore.

Author(s)

Matthew P. Conomos, Stephanie M. Gogarten, Tamar Sofer, Ken Rice, Chaoyu Yu

References

Chen H, Wang C, Conomos MP, Stilp AM, Li Z, Sofer T, Szpiro AA, Chen W, Brehm JM, Celedon JC, Redline S, Papanicolaou GJ, Thornton TA, Laurie CC, Rice K and Lin X. (2016) Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies Using Logistic Mixed Models. American Journal of Human Genetics, 98(4):653-66.

Sofer, T., Zheng, X., Gogarten, S. M., Laurie, C. A., Grinde, K., Shaffer, J. R., ... & Rice, K. M. (2019). A fully adjusted two-stage procedure for rank-normalization in genetic association studies. Genetic epidemiology, 43(3), 263-275.

Breslow NE and Clayton DG. (1993). Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association 88: 9-25.

Gilmour, A.R., Thompson, R., & Cullis, B.R. (1995). Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics, 1440-1450.

Examples

library(GWASTools)

# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")

# run PC-AiR
mypcair <- pcair(HapMap_genoData, kinobj = HapMap_ASW_MXL_KINGmat,
                divobj = HapMap_ASW_MXL_KINGmat)

# run PC-Relate
HapMap_genoData <- GenotypeBlockIterator(HapMap_genoData, snpBlock=20000)
mypcrel <- pcrelate(HapMap_genoData, pcs = mypcair$vectors[,1,drop=FALSE],
    			training.set = mypcair$unrels,
                        BPPARAM = BiocParallel::SerialParam())
close(HapMap_genoData)

# generate a phenotype
set.seed(4)
pheno <- 0.2*mypcair$vectors[,1] + rnorm(mypcair$nsamp, mean = 0, sd = 1)

annot <- data.frame(sample.id = mypcair$sample.id,
                    pc1 = mypcair$vectors[,1], pheno = pheno)

# make covariance matrix
cov.mat <- pcrelateToMatrix(mypcrel, verbose=FALSE)[annot$sample.id, annot$sample.id]

# fit the null mixed model
nullmod <- fitNullModel(annot, outcome = "pheno", covars = "pc1", cov.mat = cov.mat)
library(GWASTools)

# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")

# run PC-AiR
mypcair <- pcair(HapMap_genoData, kinobj = HapMap_ASW_MXL_KINGmat,
                divobj = HapMap_ASW_MXL_KINGmat)

# run PC-Relate
HapMap_genoData <- GenotypeBlockIterator(HapMap_genoData, snpBlock=20000)
mypcrel <- pcrelate(HapMap_genoData, pcs = mypcair$vectors[,1,drop=FALSE],
    			training.set = mypcair$unrels,
                        BPPARAM = BiocParallel::SerialParam())
close(HapMap_genoData)

# generate a phenotype
set.seed(4)
pheno <- 0.2*mypcair$vectors[,1] + rnorm(mypcair$nsamp, mean = 0, sd = 1)

annot <- data.frame(sample.id = mypcair$sample.id,
                    pc1 = mypcair$vectors[,1], pheno = pheno)

# make covariance matrix
cov.mat <- pcrelateToMatrix(mypcrel, verbose=FALSE)[annot$sample.id, annot$sample.id]

# fit the null mixed model
nullmod <- fitNullModel(annot, outcome = "pheno", covars = "pc1", cov.mat = cov.mat)

Defunct functions in package GENESIS

Description

These functions are defunct and no longer available.

Details

The following functions are defunct; use the replacement indicated below:

admixMapMM: admixMap
assocTestMM: assocTestSingle
assocTestSeq: assocTestAggregate
assocTestSeqWindow: assocTestAggregate
fitNullMM: fitNullModel
fitNullReg: fitNullModel
king2mat: kingToMatrix
pcrelate,GenotypeData-method: pcrelate,GenotypeIterator-method
pcrelate,SeqVarData-method: pcrelate,SeqVarIterator-method
pcrelateMakeGRM: pcrelateToMatrix
pcrelateReadInbreed: pcrelate
pcrelateReadKinship: pcrelate

Matrix of Pairwise Kinship Coefficient Estimates for the combined HapMap ASW and MXL Sample found with the KING-robust estimator from the KING software.

Description

KING-robust kinship coefficient estimates for the combined HapMap African Americans in the Southwest U.S. (ASW) and Mexican Americans in Los Angeles (MXL) samples.

Usage

data(HapMap_ASW_MXL_KINGmat)data(HapMap_ASW_MXL_KINGmat)

Format

The format is: num [1:173, 1:173] 0 0.00157 -0.00417 0.00209 0.00172 ...

Value

A matrix of pairwise kinship coefficient estimates as calculated with KING-robust for the combined HapMap African Americans in the Southwest U.S. (ASW) and Mexican Americans in Los Angeles (MXL) samples.

Source

http://hapmap.ncbi.nlm.nih.gov/

References

International HapMap 3 Consortium. (2010). Integrating common and rare genetic variation in diverse human populations. Nature, 467(7311), 52-58.

Perform a joint score test

Description

jointScoreTest is used to perform a joint score test of a set of variants using a null model and a matrix of genotype dosages.

Usage

jointScoreTest(null.model, G)
jointScoreTest(null.model, G)

Arguments

`null.model`	A null model object returned by `fitNullModel`.
`G`	A matrix of genotype dosages, where samples are the rows and variants are the columns.

Details

jointScoreTest takes the object returned by fitNullModel and performs a joint score test for all variants in the G matrix using this null model. All effect size and PVE estimates in fixef are adjusted for (i.e. conditional on) all other variants included in G.

The G matrix must be formatted such that the rows are samples and the columns are variants, and the entries are the dosage for that sample and variant. The matrix must have sample.id as rownames; this is used to match the genotypes to the null model. Therefore, the same sample identifiers must be used in both null.model and G. G can contain additional samples and ordering is unimportant as long as all samples from the null model are present; it will be subset and reordered to match the set of samples in the null model.

Colnames for G are not required. If present, the column names of G will be used as the rownames of the fixef element and the column and rownames of the betaCov element of the output. fixef and betaCov will be in the same order as the columns in G.

Missing data in G are not allowed.

p-values that are calculated using pchisq and are smaller than .Machine\$double.xmin are set to .Machine\$double.xmin.

Value

jointScoreTest returns a list with the following elements:

`Joint.Stat`	Squared joint score test statistic for all variants in `G`.
`Joint.pval`	p-value associated with the joint score test statistic drawn from a $\chi^2$ distribution with $n_\mathrm{variants}$ degrees of freedom.
`Joint.PVE`	Estimate of the proportion of variance explained jointly by the variants in `G`.
`fixef`	A data.frame with joint effect size estimates (Est), standard errors (SE), chi-squared test statistics (Stat), p-values (pval), and estimated proportion of variance explained (PVE) for each of the variants specified in `G`.
`betaCov`	Estimated covariance matrix for the variants in `G`.

Author(s)

Adrienne M. Stilp, Matthew P. Conomos

Examples

library(GWASTools)

# File path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# Read in GDS data
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# Create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)

# Prepare genotypes for the first five SNPs and the first 20 samples.
# Transpose it so that samples are rows and SNPs are columns.
geno <- t(getGenotype(HapMap_genoData, snp = c(1, 5), scan = c(1, 20)))
# Set row and column names.
rownames(geno) <- as.character(GWASTools::getScanID(HapMap_genoData))[1:20]
colnames(geno) <- sprintf("snp%s", 1:5)

# Create a phenotype
set.seed(5)
phen <- data.frame(
  outcome = rnorm(1:20),
  covar = rnorm(1:20),
  stringsAsFactors = FALSE
)
rownames(phen) <- rownames(geno)

# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
cov.mat = HapMap_ASW_MXL_KINGmat[rownames(phen), rownames(phen)]

# Fit a null model.
nullmod <- fitNullModel(phen,
                        outcome = "outcome",
                        covars = "covar",
                        cov.mat = cov.mat)


# Run the joint score test.
jointScoreTest(nullmod, geno)

close(HapMap_genoData)

library(GWASTools)

# File path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# Read in GDS data
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# Create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)

# Prepare genotypes for the first five SNPs and the first 20 samples.
# Transpose it so that samples are rows and SNPs are columns.
geno <- t(getGenotype(HapMap_genoData, snp = c(1, 5), scan = c(1, 20)))
# Set row and column names.
rownames(geno) <- as.character(GWASTools::getScanID(HapMap_genoData))[1:20]
colnames(geno) <- sprintf("snp%s", 1:5)

# Create a phenotype
set.seed(5)
phen <- data.frame(
  outcome = rnorm(1:20),
  covar = rnorm(1:20),
  stringsAsFactors = FALSE
)
rownames(phen) <- rownames(geno)

# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
cov.mat = HapMap_ASW_MXL_KINGmat[rownames(phen), rownames(phen)]

# Fit a null model.
nullmod <- fitNullModel(phen,
                        outcome = "outcome",
                        covars = "covar",
                        cov.mat = cov.mat)


# Run the joint score test.
jointScoreTest(nullmod, geno)

close(HapMap_genoData)

Store kinship matrix in GDS

Description

kin2gds and mat2gds write kinship matrices to GDS files.

Usage

kin2gds(ibdobj, gdsfile)
mat2gds(mat, gdsfile)
kin2gds(ibdobj, gdsfile)
mat2gds(mat, gdsfile)

Arguments

`ibdobj`	A list with elements `sample.id` and `kinship`, such as the output of `snpgdsIBDKING`.
`mat`	A matrix with sample identifiers in colnames.
`gdsfile`	A character string with the name of the GDS file to create.

Details

When using pcair or pcairPartition with large sample sizes, it can be more memory efficient to read data from GDS files. kin2gds and mat2gds store kinship matrices in GDS files for use with these functions.

Author(s)

Stephanie M. Gogarten

Examples

library(gdsfmt)

# KING results from the command-line program
file.kin0 <- system.file("extdata", "MXL_ASW.kin0", package="GENESIS")
file.kin <- system.file("extdata", "MXL_ASW.kin", package="GENESIS")
KINGmat <- kingToMatrix(c(file.kin0, file.kin), estimator="Kinship")
gdsfile <- tempfile()
mat2gds(KINGmat, gdsfile)
gds <- openfn.gds(gdsfile)
gds
closefn.gds(gds)

# KING results from SNPRelate
library(SNPRelate)
geno <- snpgdsOpen(snpgdsExampleFileName())
king <- snpgdsIBDKING(geno)
closefn.gds(geno)
kin2gds(king, gdsfile)
gds <- openfn.gds(gdsfile)
gds
closefn.gds(gds)
library(gdsfmt)

# KING results from the command-line program
file.kin0 <- system.file("extdata", "MXL_ASW.kin0", package="GENESIS")
file.kin <- system.file("extdata", "MXL_ASW.kin", package="GENESIS")
KINGmat <- kingToMatrix(c(file.kin0, file.kin), estimator="Kinship")
gdsfile <- tempfile()
mat2gds(KINGmat, gdsfile)
gds <- openfn.gds(gdsfile)
gds
closefn.gds(gds)

# KING results from SNPRelate
library(SNPRelate)
geno <- snpgdsOpen(snpgdsExampleFileName())
king <- snpgdsIBDKING(geno)
closefn.gds(geno)
kin2gds(king, gdsfile)
gds <- openfn.gds(gdsfile)
gds
closefn.gds(gds)

Convert KING text output to an R Matrix

Description

kingToMatrix is used to extract the pairwise kinship coefficient estimates from the output text files of KING –ibdseg, KING –kinship, or KING –related and put them into an R object of class Matrix. One use of this matrix is that it can be read by the functions pcair and pcairPartition.

Usage

## S4 method for signature 'character'
kingToMatrix(king, estimator = c("PropIBD", "Kinship"), sample.include = NULL,
    thresh = NULL, verbose = TRUE)
## S4 method for signature 'snpgdsIBDClass'
kingToMatrix(king, sample.include = NULL, thresh = 2^(-11/2), verbose = TRUE)
## S4 method for signature 'character'
kingToMatrix(king, estimator = c("PropIBD", "Kinship"), sample.include = NULL,
    thresh = NULL, verbose = TRUE)
## S4 method for signature 'snpgdsIBDClass'
kingToMatrix(king, sample.include = NULL, thresh = 2^(-11/2), verbose = TRUE)

Arguments

`king`	Output from KING, either a `snpgdsIBDClass` object from `snpgdsIBDKING` or a character vector of one or more file names output from the command-line version of KING; see 'Details'.
`estimator`	Which estimates to read in when using command-line KING output; must be either "PropIBD" or "Kinship"; see 'Details'.
`sample.include`	An optional vector of sample.id indicating all samples that should be included in the output matrix; see 'Details' for usage.
`thresh`	Kinship threshold for clustering samples to make the output matrix sparse block-diagonal. When `NULL`, no clustering is done. See 'Details'.
`verbose`	A logical indicating whether or not to print status updates to the console; the default is TRUE.

Details

king can be a vector of multiple file names if your KING output is stored in multiple files; e.g. KING –kinship run with family IDs returns a .kin and a .kin0 file for pairs within and not within the same family, respectively.

When reading command-line KING output, the estimator argument is required to specify which estimates to read in. When reading KING –ibdseg output, only "PropIBD" will be available; when reading KING –kinship output, only "Kinship" will be available; when reading KING –related output, both "PropIBD" and "Kinship" will be available - use this argument to select which to read. See the KING documentation for details on each estimator.

sample.include has two primary functions: 1) It can be used to subset the KING output. 2) sample.include can include sample.id not in king; this ensures that all samples will be in the output matrix when reading KING –ibdseg output, which likely does not contain all pairs. When left NULL, the function will determine the list of samples from what is observed in king. It is recommended to use sample.include to ensure all of your samples are included in the output matrix.

thresh sets a threhsold for clustering samples such that any pair with an estimated kinship value greater than thresh is in the same cluster. All pairwise estimates within a cluster are kept, even if they are below thresh. All pairwise estimates between clusters are set to 0, creating a sparse, block-diagonal matrix. When thresh is NULL, no clustering is done and all samples are returned in one block. This feature is useful when converting KING –ibdseg or KING –robust estimates to be used as a kinship matrix, if you have a lower threshold that you consider 'related'. This feature should not be used when converting KING –robust estimates to be used as divobj in pcair or pcairPartition, as PC-AiR requires the negative estimates to identify ancestrally divergent pairs.

Value

An object of class 'Matrix' with pairwise kinship coefficients by KING –ibdseg or KING –robust for each pair of individuals in the sample. The estimates are on both the upper and lower triangle of the matrix, and the diagonal is arbitrarily set to 0.5. sample.id are set as the column and row names of the matrix.

Author(s)

Matthew P. Conomos

References

Manichaikul, A., Mychaleckyj, J.C., Rich, S.S., Daly, K., Sale, M., & Chen, W.M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867-2873.

Examples

# KING --kinship
file.king <- c(system.file("extdata", "MXL_ASW.kin0", package="GENESIS"),
               system.file("extdata", "MXL_ASW.kin", package="GENESIS"))
KINGmat <- kingToMatrix(file.king, estimator="Kinship")

# KING --ibdseg
file.king <- system.file("extdata", "HapMap.seg", package="GENESIS")
KINGmat <- kingToMatrix(file.king, estimator="PropIBD")

# SNPRelate
library(SNPRelate)
gds <- snpgdsOpen(system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS"))
king <- snpgdsIBDKING(gds)
KINGmat <- kingToMatrix(king)
snpgdsClose(gds)
# KING --kinship
file.king <- c(system.file("extdata", "MXL_ASW.kin0", package="GENESIS"),
               system.file("extdata", "MXL_ASW.kin", package="GENESIS"))
KINGmat <- kingToMatrix(file.king, estimator="Kinship")

# KING --ibdseg
file.king <- system.file("extdata", "HapMap.seg", package="GENESIS")
KINGmat <- kingToMatrix(file.king, estimator="PropIBD")

# SNPRelate
library(SNPRelate)
gds <- snpgdsOpen(system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS"))
king <- snpgdsIBDKING(gds)
KINGmat <- kingToMatrix(king)
snpgdsClose(gds)

Make a sparse matrix from a dense matrix or a table of pairwise values

Description

makeSparseMatrix is used to create a sparse block-diagonal matrix from a dense matrix or a table of pairwise values, using a user-specified threshold for sparsity. An object of class Matrix will be returned. A sparse matrix may be useful for fitting the association test null model with fitNullModel when working with very large sample sizes.

Usage

## S4 method for signature 'data.table'
makeSparseMatrix(x, thresh = NULL, sample.include = NULL, diag.value = NULL,
                 verbose = TRUE)
## S4 method for signature 'data.frame'
makeSparseMatrix(x, thresh = NULL, sample.include = NULL, diag.value = NULL, 
                 verbose = TRUE)
## S4 method for signature 'matrix'
makeSparseMatrix(x, thresh = 2^(-11/2), sample.include = NULL, diag.value = NULL, 
                 verbose = TRUE)
## S4 method for signature 'Matrix'
makeSparseMatrix(x, thresh = 2^(-11/2), sample.include = NULL, diag.value = NULL, 
                 verbose = TRUE)
## S4 method for signature 'data.table'
makeSparseMatrix(x, thresh = NULL, sample.include = NULL, diag.value = NULL,
                 verbose = TRUE)
## S4 method for signature 'data.frame'
makeSparseMatrix(x, thresh = NULL, sample.include = NULL, diag.value = NULL, 
                 verbose = TRUE)
## S4 method for signature 'matrix'
makeSparseMatrix(x, thresh = 2^(-11/2), sample.include = NULL, diag.value = NULL, 
                 verbose = TRUE)
## S4 method for signature 'Matrix'
makeSparseMatrix(x, thresh = 2^(-11/2), sample.include = NULL, diag.value = NULL, 
                 verbose = TRUE)

Arguments

`x`	An object to coerce to a sparse matrix. May be of class `matrix`, `Matrix`, `data.frame`, or `data.table`. When a `matrix` or `Matrix`, row and column names should be set to sample.ids; when a data.frame or data.table, should have 3 columns: ID1, ID2, and value.
`thresh`	Value threshold for clustering samples to make the output matrix sparse block-diagonal. When `NULL`, no clustering is done. See 'Details'.
`sample.include`	An optional vector of sample.id indicating all samples that should be included in the output matrix; see 'Details' for usage.
`diag.value`	When `NULL` (by Default), values for the diagonal of the output matrix should be provided in `x`. Setting `diag.value` to a numeric value will make all values on the diagonal of the output matrix that value.
`verbose`	A logical indicating whether or not to print status updates to the console; the default is TRUE.

Details

sample.include has two primary functions: 1) It can be used to subset samples provided in x. 2) sample.include can include sample.id not in x; these additional samples will be included in the output matrix, with a value of 0 for all off-diagonal elements, and the value provided by diag.value for the diagonal elements. When left NULL, the function will determine the list of samples from what is observed in x.

thresh sets a threhsold for clustering samples such that any pair with a value greater than thresh is in the same cluster. All values within a cluster are kept, even if they are below thresh. All values between clusters are set to 0, creating a sparse, block-diagonal matrix. When thresh is NULL, no clustering is done and all samples are returned in one block. This feature is useful when converting a data.frame of kinship estimates to a matrix.

Value

An object of class 'Matrix'. Samples may be in a different order than in the input x, as samples are sorted by ID or rowname within each block (including within the block of unrelateds).

Author(s)

Matthew P. Conomos

PC-AiR: Principal Components Analysis in Related Samples

Description

pcair is used to perform a Principal Components Analysis using genome-wide SNP data for the detection of population structure in a sample. Unlike a standard PCA, PC-AiR accounts for sample relatedness (known or cryptic) to provide accurate ancestry inference that is not confounded by family structure.

Usage

## S4 method for signature 'gds.class'
pcair(gdsobj, kinobj = NULL, divobj = NULL,
                            kin.thresh = 2^(-11/2), div.thresh = -2^(-11/2),
                            unrel.set = NULL, 
                            sample.include = NULL, snp.include = NULL,
                            num.cores = 1L, verbose = TRUE, ...)
## S4 method for signature 'SNPGDSFileClass'
pcair(gdsobj, ...)
## S4 method for signature 'GdsGenotypeReader'
pcair(gdsobj, ...)
## S4 method for signature 'MatrixGenotypeReader'
pcair(gdsobj, ...)
## S4 method for signature 'GenotypeData'
pcair(gdsobj, ...)
## S4 method for signature 'SeqVarGDSClass'
pcair(gdsobj, ...)
## S4 method for signature 'gds.class'
pcair(gdsobj, kinobj = NULL, divobj = NULL,
                            kin.thresh = 2^(-11/2), div.thresh = -2^(-11/2),
                            unrel.set = NULL, 
                            sample.include = NULL, snp.include = NULL,
                            num.cores = 1L, verbose = TRUE, ...)
## S4 method for signature 'SNPGDSFileClass'
pcair(gdsobj, ...)
## S4 method for signature 'GdsGenotypeReader'
pcair(gdsobj, ...)
## S4 method for signature 'MatrixGenotypeReader'
pcair(gdsobj, ...)
## S4 method for signature 'GenotypeData'
pcair(gdsobj, ...)
## S4 method for signature 'SeqVarGDSClass'
pcair(gdsobj, ...)

Arguments

`gdsobj`	An object providing a connection to a GDS file.
`kinobj`	A symmetric matrix of pairwise kinship coefficients for every pair of individuals in the sample: upper and lower triangles must both be filled; diagonals should be self-kinship or set to a non-missing constant value. This matrix is used for partitioning the sample into the 'unrelated' and 'related' subsets. See 'Details' for how this interacts with `kin.thresh` and `unrel.set`. IDs for each individual must be set as the column names of the matrix. This matrix may also be provided as a GDS object; see 'Details'.
`divobj`	A symmetric matrix of pairwise ancestry divergence measures for every pair of individuals in the sample: upper and lower triangles must both be filled; diagonals should be set to a non-missing constant value. This matrix is used for partitioning the sample into the 'unrelated' and 'related' subsets. See 'Details' for how this interacts with `div.thresh`. IDs for each individual must be set as the column names of the matrix. This matrix may be identical to `kinobj`. This matrix may be `NULL` to ignore ancestry divergence. This matrix may also be provided as a GDS object; see 'Details'.
`kin.thresh`	Threshold value on `kinobj` used for declaring each pair of individuals as related or unrelated. The default value is 2^(-11/2) ~ 0.022, corresponding to 4th degree relatives. See 'Details' for how this interacts with `kinobj`.
`div.thresh`	Threshold value on `divobj` used for deciding if each pair of individuals is ancestrally divergent. The default value is -2^(-11/2) ~ -0.022. See 'Details' for how this interacts with `divobj`.
`unrel.set`	An optional vector of IDs for identifying individuals that are forced into the unrelated subset. See 'Details' for how this interacts with `kinobj`.
`sample.include`	An optional vector of IDs for selecting samples to consider for either set.
`snp.include`	An optional vector of snp or variant IDs to use in the analysis.
`num.cores`	The number of cores to use.
`verbose`	Logical indicator of whether updates from the function should be printed to the console; the default is TRUE.
`...`	Additional arguments to pass to `snpgdsPCA`, such as `eigen.cnt` to control the number of PCs returned, `num.thread` for parallel execution, or `algorithm='randomized' for fastPCA (recommended for large sample sets).`

Details

The basic premise of PC-AiR is to partition the entire sample of individuals into an ancestry representative 'unrelated subset' and a 'related set', perform standard PCA on the 'unrelated subset', and predict PC values for the 'related subset'.

There are multiple ways to partition the sample into an ancestry representative 'unrelated subset' and a 'related subset'. In all of the scenarios described below, the set of all samples is limited to those in sample.include when it is specified (i.e. not NULL):

If kinobj is specified, divobj is specified, and unrel.set = NULL, then the PC-AiR algorithm is used to find an 'optimal' partition of all samples (see 'References' for a paper describing the PC-AiR algorithm).

If kinobj is specified, divobj is specified, and unrel.set is specified, then all individuals with IDs in unrel.set are forced in the 'unrelated subset' and the PC-AiR algorithm is used to partition the rest of the sample; this is especially useful for including reference samples of known ancestry in the 'unrelated subset'.

If kinobj is specified, and divobj = NULL, then kinobj is used to define the unrelated set but ancestry divergence is ignored.

If kinobj = NULL, and unrel.set is specified, then all individuals with IDs in unrel.set are put in the 'unrelated subset' and the rest of the individuals are put in the 'related subset'.

If kinobj = NULL, and unrel.set = NULL, then the function will perform a "standard" PCA analysis.

NOTE: kinobj and divobj may be identical.

All pcair methods take the same arguments, as they ultimately call the gds.class method. The MatrixGenotypeReader method is implemented by writing a temporary GDS file.

Value

An object of class 'pcair'. A list including:

`vectors`	A matrix of principal components; each column is a principal component. Sample IDs are provided as rownames. The number of PCs returned can be adjusted by supplying the `eigen.cnt` argument, which is passed to `snpgdsPCA`.
`values`	A vector of eigenvalues matching the principal components. These values are determined from the standard PCA run on the 'unrelated subset'.
`rels`	A vector of IDs for individuals in the 'related subset'.
`unrels`	A vector of IDs for individuals in the 'unrelated subset'.
`kin.thresh`	The threshold value used for declaring each pair of individuals as related or unrelated.
`div.thresh`	The threshold value used for determining if each pair of individuals is ancestrally divergent.
`sample.id`	A vector of IDs for the samples used in the analysis.
`nsamp`	The total number of samples in the analysis.
`nsnps`	The total number of SNPs used in the analysis.
`varprop`	The variance proportion for each principal component.
`call`	The function call passed to `pcair`.
`method`	A character string. Either "PC-AiR" or "Standard PCA" identifying which method was used for computing principal components. "Standard PCA" is used if no relatives were identified in the sample.

Author(s)

Matthew P. Conomos

References

Manichaikul, A., Mychaleckyj, J.C., Rich, S.S., Daly, K., Sale, M., & Chen, W.M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867-2873.

Examples

# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- gdsfmt::openfn.gds(gdsfile)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
# run PC-AiR
mypcair <- pcair(HapMap_geno, kinobj = HapMap_ASW_MXL_KINGmat, 
                 divobj = HapMap_ASW_MXL_KINGmat)
gdsfmt::closefn.gds(HapMap_geno)
# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- gdsfmt::openfn.gds(gdsfile)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
# run PC-AiR
mypcair <- pcair(HapMap_geno, kinobj = HapMap_ASW_MXL_KINGmat, 
                 divobj = HapMap_ASW_MXL_KINGmat)
gdsfmt::closefn.gds(HapMap_geno)

Partition a sample into an ancestry representative 'unrelated subset' and a 'related subset'

Description

pcairPartition is used to partition a sample from a genetic study into an ancestry representative 'unrelated subset' and a 'related subset'. The 'unrelated subset' contains individuals who are all mutually unrelated to each other and representative of the ancestries of all individuals in the sample, and the 'related subset' contains individuals who are related to someone in the 'unrealted subset'.

Usage

pcairPartition(kinobj, divobj = NULL,
               kin.thresh = 2^(-11/2), div.thresh = -2^(-11/2),
               unrel.set = NULL, sample.include = NULL, verbose = TRUE)
pcairPartition(kinobj, divobj = NULL,
               kin.thresh = 2^(-11/2), div.thresh = -2^(-11/2),
               unrel.set = NULL, sample.include = NULL, verbose = TRUE)

Arguments

`kinobj`	A symmetric matrix of pairwise kinship coefficients for every pair of individuals in the sample: upper and lower triangles must both be filled; diagonals should be self-kinship or set to a non-missing constant value. This matrix is used for partitioning the sample into the 'unrelated' and 'related' subsets. See 'Details' for how this interacts with `kin.thresh` and `unrel.set`. IDs for each individual must be set as the column names of the matrix. This matrix may also be provided as a GDS object; see 'Details'.
`divobj`	A symmetric matrix of pairwise ancestry divergence measures for every pair of individuals in the sample: upper and lower triangles must both be filled; diagonals should be set to a non-missing constant value. This matrix is used for partitioning the sample into the 'unrelated' and 'related' subsets. See 'Details' for how this interacts with `div.thresh`. IDs for each individual must be set as the column names of the matrix.This matrix may be identical to `kinobj`. This matrix may be `NULL` to ignore ancestry divergence. This matrix may also be provided as a GDS object; see 'Details'.
`kin.thresh`	Threshold value on `kinobj` used for declaring each pair of individuals as related or unrelated. The default value is 2^(-11/2) ~ 0.022, corresponding to 4th degree relatives. See 'Details' for how this interacts with `kinobj`.
`div.thresh`	Threshold value on `divobj` used for deciding if each pair of individuals is ancestrally divergent. The default value is -2^(-11/2) ~ -0.022. See 'Details' for how this interacts with `divobj`.
`unrel.set`	An optional vector of IDs for identifying individuals that are forced into the unrelated subset. See 'Details' for how this interacts with `kinobj`.
`sample.include`	An optional vector of IDs for selecting samples to consider for either set.
`verbose`	Logical indicator of whether updates from the function should be printed to the console; the default is TRUE.

Details

We recommend using software that accounts for population structure to estimate pairwise kinship coefficients to be used in kinobj. Any pair of individuals with a pairwise kinship greater than kin.thresh will be declared 'related.' Kinship coefficient estimates from the KING-robust software are typically used as measures of ancestry divergence in divobj. Any pair of individuals with a pairwise divergence measure less than div.thresh will be declared ancestrally 'divergent'. Typically, kin.thresh and div.thresh are set to be the amount of error around 0 expected in the estimate for a pair of truly unrelated individuals. If unrel.set = NULL, the PC-AiR algorithm is used to find an 'optimal' partition (see 'References' for a paper describing the algorithm). If unrel.set and kinobj are both specified, then all individuals with IDs in unrel.set are forced in the 'unrelated subset' and the PC-AiR algorithm is used to partition the rest of the sample; this is especially useful for including reference samples of known ancestry in the 'unrelated subset'.

For large sample sizes, storing both kinobj and divobj in memory may be prohibitive. Both matrices may be stored in GDS files and provided as gds.class objects. mat2gds saves matrices in GDS format. Alternatively, kinobj (but not divobj) can be represented as a sparse Matrix object; see kingToMatrix and pcrelateToMatrix.

Matrix objects from the Matrix package are also supported.

Value

A list including:

`rels`	A vector of IDs for individuals in the 'related subset'.
`unrels`	A vector of IDs for individuals in the 'unrelated subset'.

Note

pcairPartition is called internally in the function pcair but may also be used on its own to partition the sample into an ancestry representative 'unrelated' subset and a 'related' subset without performing PCA.

Author(s)

Matthew P. Conomos

References

Manichaikul, A., Mychaleckyj, J.C., Rich, S.S., Daly, K., Sale, M., & Chen, W.M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867-2873.

Examples

# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
# partition the sample
part <- pcairPartition(kinobj = HapMap_ASW_MXL_KINGmat, 
                       divobj = HapMap_ASW_MXL_KINGmat)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
# partition the sample
part <- pcairPartition(kinobj = HapMap_ASW_MXL_KINGmat, 
                       divobj = HapMap_ASW_MXL_KINGmat)

PC-Relate: Model-Free Estimation of Recent Genetic Relatedness

Description

pcrelate is used to estimate kinship coefficients, IBD sharing probabilities, and inbreeding coefficients using genome-wide SNP data. PC-Relate accounts for population structure (ancestry) among sample individuals through the use of ancestry representative principal components (PCs) to provide accurate relatedness estimates due only to recent family (pedigree) structure.

Usage

## S4 method for signature 'GenotypeIterator'
pcrelate(gdsobj,
	pcs,
	scale = c('overall', 'variant', 'none'),
	ibd.probs = TRUE,
	sample.include = NULL,
	training.set = NULL,
	sample.block.size = 5000,
	maf.thresh = 0.01,
	maf.bound.method = c('filter', 'truncate'),
	small.samp.correct = TRUE,
	BPPARAM = bpparam(),
	verbose = TRUE)
## S4 method for signature 'SeqVarIterator'
pcrelate(gdsobj,
	pcs,
	scale = c('overall', 'variant', 'none'),
	ibd.probs = TRUE,
	sample.include = NULL,
	training.set = NULL,
	sample.block.size = 5000,
	maf.thresh = 0.01,
	maf.bound.method = c('filter', 'truncate'),
	small.samp.correct = TRUE,
	BPPARAM = bpparam(),
	verbose = TRUE)
samplesGdsOrder(gdsobj, sample.include)
calcISAFBeta(gdsobj,
        pcs,
        sample.include,
        training.set = NULL,
	    BPPARAM = bpparam(),
        verbose = TRUE)
pcrelateSampBlock(gdsobj,
        betaobj,
        pcs,
        sample.include.block1,
        sample.include.block2,
	scale = c('overall', 'variant', 'none'),
	ibd.probs = TRUE,
	maf.thresh = 0.01,
	maf.bound.method = c('filter', 'truncate'),
	BPPARAM = bpparam(),
	verbose = TRUE)
correctKin(kinBtwn, kinSelf,
        pcs,
        sample.include = NULL)
correctK2(kinBtwn, kinSelf,
        pcs,
        sample.include = NULL,
        small.samp.correct = TRUE)
correctK0(kinBtwn)
## S4 method for signature 'GenotypeIterator'
pcrelate(gdsobj,
	pcs,
	scale = c('overall', 'variant', 'none'),
	ibd.probs = TRUE,
	sample.include = NULL,
	training.set = NULL,
	sample.block.size = 5000,
	maf.thresh = 0.01,
	maf.bound.method = c('filter', 'truncate'),
	small.samp.correct = TRUE,
	BPPARAM = bpparam(),
	verbose = TRUE)
## S4 method for signature 'SeqVarIterator'
pcrelate(gdsobj,
	pcs,
	scale = c('overall', 'variant', 'none'),
	ibd.probs = TRUE,
	sample.include = NULL,
	training.set = NULL,
	sample.block.size = 5000,
	maf.thresh = 0.01,
	maf.bound.method = c('filter', 'truncate'),
	small.samp.correct = TRUE,
	BPPARAM = bpparam(),
	verbose = TRUE)
samplesGdsOrder(gdsobj, sample.include)
calcISAFBeta(gdsobj,
        pcs,
        sample.include,
        training.set = NULL,
	    BPPARAM = bpparam(),
        verbose = TRUE)
pcrelateSampBlock(gdsobj,
        betaobj,
        pcs,
        sample.include.block1,
        sample.include.block2,
	scale = c('overall', 'variant', 'none'),
	ibd.probs = TRUE,
	maf.thresh = 0.01,
	maf.bound.method = c('filter', 'truncate'),
	BPPARAM = bpparam(),
	verbose = TRUE)
correctKin(kinBtwn, kinSelf,
        pcs,
        sample.include = NULL)
correctK2(kinBtwn, kinSelf,
        pcs,
        sample.include = NULL,
        small.samp.correct = TRUE)
correctK0(kinBtwn)

Arguments

`gdsobj`	An object of class `SeqVarIterator` from the package SeqVarTools, or an object of class `GenotypeIterator` from the package GWASTools, containing the genotype data for the variants and samples to be used for the analysis.
`pcs`	A matrix of principal components (PCs) to be used for ancestry adjustment. Each column represents a PC, and each row represents an individual. IDs for each individual must be set as the row names of the matrix.
`scale`	A character string taking the values 'overall', 'variant', or 'none' indicating how genotype values should be standardized. This should be set to 'overall' (the default) in order to do a PC-Relate analysis; see 'Details' for more information.
`ibd.probs`	Logical indicator of whether pairwise IBD sharing probabilities (k0, k1, k2) should be estimated; the default is TRUE.
`sample.include`	A vector of IDs for samples to include in the analysis. If NULL, all samples in `gdsobj` are included.
`training.set`	An optional vector of IDs identifying which samples to use for estimation of the ancestry effect when estimating individual-specific allele frequencies. If NULL, all samples in sample.include are used. See 'Details' for more information.
`sample.block.size`	The number of individuals to read-in/analyze at once; the default value is 5000. See 'Details' for more information.
`maf.thresh`	Minor allele frequency threshold; if an individual's estimated individual-specific minor allele frequency at a SNP is less than this value, that indivdiual will either have that SNP excluded from the analysis or have their estimated indivdiual-specific minor allele frequency truncated to this value, depending on `maf.bound.method`. The default value is 0.01.
`maf.bound.method`	How individual-specific minor allele frequency estimates less that `maf.thresh` are handled. When set to ‘filter' (default), SNPs for which an individual’s estimated individual-specific minor allele frequency are below `maf.thresh` are excluded from the analysis for that individual. When set to 'truncate', estimated individual-specific minor allele frequncies below `maf.thresh` have their value set to `maf.thresh`.
`small.samp.correct`	Logical indicator of whether to implement a small sample correction. The default is `TRUE`, but must be set to `FALSE` if `sample.block.size` is less than the number of samples or if `scale` is `'none'`.
`BPPARAM`	A `BiocParallelParam` object to process blocks of variants in parallel. If not provided, the default back-end returned by `bpparam` will be used.
`verbose`	Logical indicator of whether updates from the function should be printed to the console; the default is TRUE.
`betaobj`	Outut of `calcISAFBeta`.
`sample.include.block1`	A vector of IDs for samples to include in block 1.
`sample.include.block2`	A vector of IDs for samples to include in block 2.
`kinBtwn`	Output of `pcrelateSampBlock`.
`kinSelf`	Output of `pcrelateSampBlock`.

Details

The basic premise of PC-Relate is to estimate kinship coefficients, IBD sharing probabilities, and inbreeding coefficients that reflect recent family (pedigree) relatedness by conditioning out genetic similarity due to distant population structure (ancestry) with ancestry representative principal components (PCs).

It is important that the PCs used in pcs to adjust for ancestry are representative of ancestry and NOT family structure, so we recommend using PCs calculated with PC-AiR (see: pcair).

pcrelate uses the BiocParallel package to process iterator chunks in parallel. See the BiocParallel documentation for more information on the default behaviour of bpparam and how to register different parallel backends. If serial execution is desired, set BPPARAM=BiocParallel::SerialParam(). Note that parallel execution requires more RAM than serial execution.

In order to perform relatedness estimation, allele frequency estimates are required for centering and scaling genotype values. Individual-specific allele frequencies calculated for each individual at each SNP using the PCs specified in pcs are used. There are muliple choices for how genotype values are scaled. When scale is 'variant', centered genotype values at each SNP are divided by their expected variance under Hardy-Weinberg equilibrium. When scale is 'overall', centered genotype values at all SNPs are divided by the average across all SNPs of their expected variances under Hardy-Weinberg equilibrium; this scaling leads to more stable behavior when using low frequency variants. When scale is 'none', genotype values are only centered and not scaled; this won't provide accurate kinship coefficient estimates but may be useful for other purposes. Set scale to 'overall' to perform a standard PC-Relate analysis; this is the default. If scale is set to 'variant', the estimators are very similar to REAP.

The optional input training.set allows the user to specify which samples are used to estimate the ancestry effect when estimating individual-specific allele frequencies. Ideally, training.set is a set of mutually unrelated individuals. If prior information regarding pedigree structure is available, this can be used to select training.set, or if pcair was used to obtain the PCs, then the individuals in the PC-AiR 'unrelated subset' can be used. If no prior information is available, all individuals should be used.

The sample.block.size can be specified to alleviate memory issues when working with very large data sets. If sample.block.size is smaller than the number of individuals included in the analysis, then individuals will be analyzed in separate blocks. This reduces the memory required for the analysis, but genotype data must be read in multiple times for each block (to analyze all pairs), which increases the number of computations required.

calcISAFBeta and pcrelateSampBlock are provided as separate functions to allow parallelization for large sample sizes. pcrelate calls both of these functions internally. When calling these functions separately, use samplesGdsOrder to ensure the sample.include argument is in the same order as the GDS file. Use correctKin, correctK2, and correctK0 after all sample blocks have been completed.

Value

An object of class 'pcrelate'. A list including:

`kinBtwn`	A data.frame of estimated pairwise kinship coefficients and IBD sharing probabilities (if `ibd.probs` is `TRUE`).
`kinSelf`	A data.frame of estimated inbreeding coefficients.

Author(s)

Matthew P. Conomos

References

Conomos M.P., Reiner A.P., Weir B.S., & Thornton T.A. (2016). Model-free Estimation of Recent Genetic Relatedness. American Journal of Human Genetics, 98(1), 127-148.

Examples

library(GWASTools)

# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
# run PC-AiR
mypcair <- pcair(HapMap_genoData, kinobj = HapMap_ASW_MXL_KINGmat, 
                divobj = HapMap_ASW_MXL_KINGmat)

# create a GenotypeBlockIterator object
HapMap_genoData <- GenotypeBlockIterator(HapMap_genoData)
# run PC-Relate
mypcrel <- pcrelate(HapMap_genoData, pcs = mypcair$vectors[,1,drop=FALSE],
				training.set = mypcair$unrels,
				BPPARAM=BiocParallel::SerialParam())
head(mypcrel$kinBwtn)
head(mypcrel$kinSelf)

grm <- pcrelateToMatrix(mypcrel)
dim(grm)

close(HapMap_genoData)
library(GWASTools)

# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
# run PC-AiR
mypcair <- pcair(HapMap_genoData, kinobj = HapMap_ASW_MXL_KINGmat, 
                divobj = HapMap_ASW_MXL_KINGmat)

# create a GenotypeBlockIterator object
HapMap_genoData <- GenotypeBlockIterator(HapMap_genoData)
# run PC-Relate
mypcrel <- pcrelate(HapMap_genoData, pcs = mypcair$vectors[,1,drop=FALSE],
				training.set = mypcair$unrels,
				BPPARAM=BiocParallel::SerialParam())
head(mypcrel$kinBwtn)
head(mypcrel$kinSelf)

grm <- pcrelateToMatrix(mypcrel)
dim(grm)

close(HapMap_genoData)

Creates a Genetic Relationship Matrix (GRM) of Pairwise Kinship Coefficient Estimates from PC-Relate Output

Description

pcrelateToMatrix is used to create a genetic relationship matrix (GRM) of pairwise kinship coefficient estimates from the output of pcrelate.

Usage

## S4 method for signature 'pcrelate'
pcrelateToMatrix(pcrelobj, sample.include = NULL, thresh = NULL, scaleKin = 2, 
                 verbose = TRUE)
## S4 method for signature 'pcrelate'
pcrelateToMatrix(pcrelobj, sample.include = NULL, thresh = NULL, scaleKin = 2, 
                 verbose = TRUE)

Arguments

`pcrelobj`	The object containing the output from `pcrelate`. This should be a list of class `pcrelate` containing two data.frames; `kinSelf` with inbreeding coefficient estimates and `kinBtwn` with pairwise kinship coefficient estimates.
`sample.include`	A vector of IDs for samples to be included in the GRM. The default is NULL, which includes all samples in `pcrelobj`.
`thresh`	Kinship threshold for clustering samples to make the output matrix sparse block-diagonal. This thresholding is done after scaling kinship values by `scaleKin`. When `NULL`, no clustering is done. See 'Details'.
`scaleKin`	Specifies a numeric constant to scale each estimated kinship coefficient by in the GRM. The default value is 2.
`verbose`	Logical indicator of whether updates from the function should be printed to the console; the default is TRUE.

Details

This function provides a quick and easy way to construct a genetic relationship matrix (GRM) from the output of pcrelate.

thresh sets a threhsold for clustering samples such that any pair with an estimated kinship value greater than thresh is in the same cluster. All pairwise estimates within a cluster are kept, even if they are below thresh. All pairwise estimates between clusters are set to 0, creating a sparse, block-diagonal matrix. When thresh is NULL, no clustering is done and all samples are returned in one block. This feature may be useful for creating a sparse GRM when running association tests with very large sample sizes. Note that thresholding is done after scaling kinship values by scaleKin.

Value

An object of class 'Matrix' with pairwise kinship coefficients.

Author(s)

Matthew P. Conomos

References

Conomos M.P., Reiner A.P., Weir B.S., & Thornton T.A. (2016). Model-free Estimation of Recent Genetic Relatedness. American Journal of Human Genetics, 98(1), 127-148.

PC-AiR: Plotting PCs

Description

plot.pcair is used to plot pairs of principal components contained in a class 'pcair' object obtained as output from the pcair function.

Usage

## S3 method for class 'pcair'
plot(x, vx = 1, vy = 2, pch = NULL, col = NULL, 
        xlim = NULL, ylim = NULL, main = NULL, sub = NULL, 
        xlab = NULL, ylab = NULL, ...)
## S3 method for class 'pcair'
plot(x, vx = 1, vy = 2, pch = NULL, col = NULL, 
        xlim = NULL, ylim = NULL, main = NULL, sub = NULL, 
        xlab = NULL, ylab = NULL, ...)

Arguments

`x`	An object of class '`pcair`' obtained as output from the `pcair` function.
`vx`	An integer indicating which principal component to plot on the x-axis; the default is 1.
`vy`	An integer indicating which principal component to plot on the y-axis; the default is 2.
`pch`	Either an integer specifying a symbol or a single character to be used in plotting points. If `NULL`, the default is dots for the 'unrelated subset' and + for the 'related subset'.
`col`	A specification for the plotting color for points. If `NULL`, the default is black for the 'unrelated subset' and blue for the 'related subset'.
`xlim`	The range of values shown on the x-axis. If `NULL`, the default shows all points.
`ylim`	The range of values shown on the y-axis. If `NULL`, the default shows all points.
`main`	An overall title for the plot. If `NULL`, the default specifies which PC-AiR PCs are plotted.
`sub`	A sub title for the plot. If `NULL`, the default is none.
`xlab`	A title for the x-axis. If `NULL`, the default specifies which PC-AiR PC is plotted.
`ylab`	A title for the y-axis. If `NULL`, the default specifies which PC-AiR PC is plotted.
`...`	Other parameters to be passsed through to plotting functions, (see `par`).

Details

This function provides a quick and easy way to plot principal components obtained with the function pcair to visualize the population structure captured by PC-AiR.

Value

A figure showing the selected principal components plotted against each other.

Author(s)

Matthew P. Conomos

Examples

# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- gdsfmt::openfn.gds(gdsfile)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
# run PC-AiR
mypcair <- pcair(HapMap_geno, kinobj = HapMap_ASW_MXL_KINGmat, 
                 divobj = HapMap_ASW_MXL_KINGmat)
# plot top 2 PCs
plot(mypcair)
# plot PCs 3 and 4
plot(mypcair, vx = 3, vy = 4)
gdsfmt::closefn.gds(HapMap_geno)
# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- gdsfmt::openfn.gds(gdsfile)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
# run PC-AiR
mypcair <- pcair(HapMap_geno, kinobj = HapMap_ASW_MXL_KINGmat, 
                 divobj = HapMap_ASW_MXL_KINGmat)
# plot top 2 PCs
plot(mypcair)
# plot PCs 3 and 4
plot(mypcair, vx = 3, vy = 4)
gdsfmt::closefn.gds(HapMap_geno)

PC-AiR: Principal Components Analysis in Related Samples

Description

Print methods for pcair

Usage

## S3 method for class 'pcair'
print(x, ...)
## S3 method for class 'pcair'
summary(object, ...)
## S3 method for class 'summary.pcair'
print(x, ...)
## S3 method for class 'pcair'
print(x, ...)
## S3 method for class 'pcair'
summary(object, ...)
## S3 method for class 'summary.pcair'
print(x, ...)

Arguments

`object`	An object of class '`pcair`', i.e. output from the `pcair` function.
`x`	An object of class '`pcair`', i.e. output from the `pcair` function.
`...`	Further arguments passed to or from other methods.

Author(s)

Matthew P. Conomos

Examples

# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- gdsfmt::openfn.gds(gdsfile)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
# run PC-AiR
mypcair <- pcair(HapMap_geno, kinobj = HapMap_ASW_MXL_KINGmat, 
                divobj = HapMap_ASW_MXL_KINGmat)
print(mypcair)
summary(mypcair)
gdsfmt::closefn.gds(HapMap_geno)
# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- gdsfmt::openfn.gds(gdsfile)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")
# run PC-AiR
mypcair <- pcair(HapMap_geno, kinobj = HapMap_ASW_MXL_KINGmat, 
                divobj = HapMap_ASW_MXL_KINGmat)
print(mypcair)
summary(mypcair)
gdsfmt::closefn.gds(HapMap_geno)

Annotation for 1000 genomes Phase 3 samples

Description

Annotation for 1000 genomes Phase 3 samples included in the VCF files in "extdata/1KG".

Usage

    data(sample_annotation_1KG)
data(sample_annotation_1KG)

Format

A data.frame with columns:

sample.idSample identifier
PopulationPopulation of sample
sexSex of sample

Source

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp

References

A global reference for human genetic variation, The 1000 Genomes Project Consortium, Nature 526, 68-74 (01 October 2015) doi:10.1038/nature15393.

Variance Component Confidence Intervals

Description

varCompCI provides confidence intervals for the variance component estimates found using fitNullModel. The confidence intervals can be found on either the original scale or for the proportion of total variability explained.

Usage

varCompCI(null.model, prop = TRUE)
varCompCI(null.model, prop = TRUE)

Arguments

`null.model`	A null model object returned by `fitNullModel`.
`prop`	A logical indicator of whether the point estimates and confidence intervals should be returned as the proportion of total variability explained (TRUE) or on the orginal scale (FALSE).

Details

varCompCI takes the object returned by fitNullModel as its input and returns point estimates and confidence intervals for each of the random effects variance component estimates. If a kinship matrix or genetic relationship matrix (GRM) was included as a random effect in the model fit using fitNullModel, then this function can be used to provide a heritability estimate when prop is TRUE.

Value

varCompCI prints a table of point estimates and 95% confidence interval limits for each estimated variance component.

Author(s)

Matthew P. Conomos

Examples

library(GWASTools)

# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")

# run PC-AiR
mypcair <- pcair(HapMap_genoData, kinobj = HapMap_ASW_MXL_KINGmat, 
                 divobj = HapMap_ASW_MXL_KINGmat)

# run PC-Relate
HapMap_genoData <- GenotypeBlockIterator(HapMap_genoData, snpBlock=20000)
mypcrel <- pcrelate(HapMap_genoData, pcs = mypcair$vectors[,1,drop=FALSE],
    			training.set = mypcair$unrels,
                        BPPARAM = BiocParallel::SerialParam())
close(HapMap_genoData)

# generate a phenotype
set.seed(4)
pheno <- 0.2*mypcair$vectors[,1] + rnorm(mypcair$nsamp, mean = 0, sd = 1)

annot <- data.frame(sample.id = mypcair$sample.id, 
                    pc1 = mypcair$vectors[,1], pheno = pheno)

# make covariance matrix
cov.mat <- pcrelateToMatrix(mypcrel, verbose=FALSE)[annot$sample.id, annot$sample.id]

# fit the null mixed model
nullmod <- fitNullModel(annot, outcome = "pheno", covars = "pc1", cov.mat = cov.mat)

# find the variance component CIs
varCompCI(nullmod, prop = TRUE)
varCompCI(nullmod, prop = FALSE)
library(GWASTools)

# file path to GDS file
gdsfile <- system.file("extdata", "HapMap_ASW_MXL_geno.gds", package="GENESIS")
# read in GDS data
HapMap_geno <- GdsGenotypeReader(filename = gdsfile)
# create a GenotypeData class object
HapMap_genoData <- GenotypeData(HapMap_geno)
# load saved matrix of KING-robust estimates
data("HapMap_ASW_MXL_KINGmat")

# run PC-AiR
mypcair <- pcair(HapMap_genoData, kinobj = HapMap_ASW_MXL_KINGmat, 
                 divobj = HapMap_ASW_MXL_KINGmat)

# run PC-Relate
HapMap_genoData <- GenotypeBlockIterator(HapMap_genoData, snpBlock=20000)
mypcrel <- pcrelate(HapMap_genoData, pcs = mypcair$vectors[,1,drop=FALSE],
    			training.set = mypcair$unrels,
                        BPPARAM = BiocParallel::SerialParam())
close(HapMap_genoData)

# generate a phenotype
set.seed(4)
pheno <- 0.2*mypcair$vectors[,1] + rnorm(mypcair$nsamp, mean = 0, sd = 1)

annot <- data.frame(sample.id = mypcair$sample.id, 
                    pc1 = mypcair$vectors[,1], pheno = pheno)

# make covariance matrix
cov.mat <- pcrelateToMatrix(mypcrel, verbose=FALSE)[annot$sample.id, annot$sample.id]

# fit the null mixed model
nullmod <- fitNullModel(annot, outcome = "pheno", covars = "pc1", cov.mat = cov.mat)

# find the variance component CIs
varCompCI(nullmod, prop = TRUE)
varCompCI(nullmod, prop = FALSE)

Package 'GENESIS'

Help Index

GENetic EStimation and Inference in Structured samples (GENESIS): Statistical methods for analyzing genetic data from samples with population structure and/or relatedness

Description

Details

Author(s)

References

admixMap

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Aggregate Association Testing

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Genotype Association Testing with Mixed Models

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Computes variant-specific inflation factors

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Return the effect allele for association testing

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Fit a Model Under the Null Hypothesis

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Defunct functions in package GENESIS

Description

Details

Matrix of Pairwise Kinship Coefficient Estimates for the combined HapMap ASW and MXL Sample found with the KING-robust estimator from the KING software.

Description

Usage

Format

Value

Source

References

Perform a joint score test

Description

Usage

Arguments

Details

Value