Title: | SnpMatrix and XSnpMatrix classes and methods |
---|---|
Description: | Classes and statistical methods for large SNP association studies. This extends the earlier snpMatrix package, allowing for uncertainty in genotypes. |
Authors: | David Clayton <[email protected]> |
Maintainer: | David Clayton <[email protected]> |
License: | GPL-3 |
Version: | 1.57.0 |
Built: | 2025-01-17 04:53:08 UTC |
Source: | https://github.com/bioc/snpStats |
Classes and statistical methods for large SNP association studies. This extends the earlier snpMatrix package, allowing for uncertainty in genotypes.
The DESCRIPTION file:
Package: | snpStats |
Title: | SnpMatrix and XSnpMatrix classes and methods |
Version: | 1.57.0 |
Date: | 2022-07-27 |
Author: | David Clayton <[email protected]> |
Description: | Classes and statistical methods for large SNP association studies. This extends the earlier snpMatrix package, allowing for uncertainty in genotypes. |
Maintainer: | David Clayton <[email protected]> |
Depends: | R(>= 2.10.0), survival, Matrix, methods |
Imports: | graphics, grDevices, stats, utils, BiocGenerics, zlibbioc |
Suggests: | hexbin |
License: | GPL-3 |
Collate: | ss.R contingency.table.R convert.R compare.R glm-test.R imputation.R indata.R long.R misc.R ld.R mvtests.R pedfile.R outdata.R plink.R qc.R qq-chisq.R single.R tdt-single.R structure.R xstuff.R |
LazyLoad: | yes |
biocViews: | Microarray, SNP, GeneticVariability |
Packaged: | 2022-07-27 |
Repository: | https://bioc.r-universe.dev |
RemoteUrl: | https://github.com/bioc/snpStats |
RemoteRef: | HEAD |
RemoteSha: | a3f434d515c2d14f822c41e0fdfdc654d159b6c2 |
Index of help topics:
Fst Calculate fixation indices GlmEstimates-class Class "GlmEstimates" GlmTests-class Classes "GlmTests" and "GlmTestsScore" ImputationRules-class Class "ImputationRules" SingleSnpTests-class Classes "SingleSnpTests" and "SingleSnpTestsScore" SnpMatrix-class Class "SnpMatrix" XSnpMatrix-class Class "XSnpMatrix" chi.squared Extract test statistics and p-values convert.snpMatrix Convert 'snpMatrix' objects to 'snpStats' objects example-new An example of intensity data for SNP genotyping families Test data for family association tests filter.rules Filter a set of imputation rules for.exercise Data for exercise in use of the snpStats package glm.test.control Set up control object for GLM computations ibsCount Count alleles identical by state ibsDist Distance matrix based on identity by state (IBS) imputation.maf Extract statistics from imputation rules impute.snps Impute snps ld Pairwise linkage disequilibrium measures ld.example Datasets to illustrate calculation of linkage disequilibrium statistics mean2g Raw coding of posterior probabilities of SNP genotype misinherits Find non-Mendelian inheritances in family data mvtests Multivariate SNP tests plotUncertainty Plot posterior probabilities of genotype assignment pool Pool test results from several studies or sub-studies pool2 Pool results of tests from two independent datasets pp Unpack posterior probabilities from one-byte codes qq.chisq Quantile-quantile plot for chi-squared tests random.snps Generate random SnpMatrix read.beagle Read genotypes imputed by the BEAGLE program read.impute Read genotypes imputed by the IMPUTE2 program read.long Read SNP genotype data in long format read.mach Read genotypes imputed by the MACH program read.pedfile Read a pedfile as '"SnpMatrix"' object read.plink Read a PLINK binary data file as a SnpMatrix read.snps.long Read SNP data in long format (deprecated) row.summary Summarize rows or columns of a snp matrix sample.info Sample datasets to illustrate data input single.snp.tests 1-df and 2-df tests for genetic associations with SNPs (or imputed SNPs) sm.compare Compare two SnpMatrix objects snp.cor Correlations with columns of a SnpMatrix snp.imputation Calculate imputation rules snp.lhs.estimates Logistic regression with SNP genotypes as dependent variable snp.lhs.tests Score tests with SNP genotypes as dependent variable snp.pre.multiply Pre- or post-multiply a SnpMatrix object by a general matrix snp.rhs.estimates Fit GLMs with SNP genotypes as independent variable(s) snp.rhs.tests Score tests with SNP genotypes as independent variable snpStats-package SnpMatrix and XSnpMatrix classes and methods switch.alleles Switch alleles in columns of a SnpMatrix or in test results tdt.snp 1-df and 2-df tests for genetic associations with SNPs (or imputed SNPs) in family data test.allele.switch Test for switch of alleles between two collections testdata Test data for the snpStats package write.SnpMatrix Write a SnpMatrix object as a text file write.plink Write files for analysis in the PLINK toolset xxt X.X-transpose for a standardized SnpMatrix
David Clayton <[email protected]>
Maintainer: David Clayton <[email protected]>
Generic functions to extract values from the SNP association test objects returned by various testing functions
chi.squared(x, df) deg.freedom(x) effect.sign(x, simplify) p.value(x, df) sample.size(x) effective.sample.size(x)
chi.squared(x, df) deg.freedom(x) effect.sign(x, simplify) p.value(x, df) sample.size(x) effective.sample.size(x)
x |
An object of class |
df |
Either the numeric value 1 or 2 (not used when |
simplify |
This switch is relevant when |
These functions operate on objects created by
single.snp.tests
, snp.lhs.tests
, and
snp.lhs.tests
.
The functions chi.squared
and p.value
return the
chi-squared statistic and the corresponding p-value.
The argument df
is only used for
output from single.snp.tests
, since this function calculates
both 1 df and 2 df tests for each SNP. The functions
snp.lhs.tests
and snp.rhs.tests
potentially calculate
chi-squared tests on varying degrees of freedom, which can be
extracted with deg.freedom
. The function effect.sign
indicates the direction of
associations. When applied to an output object from
snp.single.tests
, it returns +1
if the association, as
measured by the 1 df test, is positive and -1
if the
association is negative. Each test calculated by GlmTests
are potentially tests of several parameters so that the effect sign
can be a vector. Thus effect.sign
returns a list of sign
vectors unless, if simplify=TRUE
, and it can be simplified as a
single vector with one sign for each test.
The function sample.size
returns the number of observations
actually used in the test, after exclusions due to missing data have
been applied, and effective.sample.size
returns the effective
sample size which is less than the true sample size for tests on
imperfectly imputed SNPs.
A numeric vector containing the
chi-squared test statistics or p-values. The output vector has a names
attribute.
The df
and simplify
arguments are not always
required (or legal). See above
David Clayton [email protected]
single.snp.tests
,
snp.lhs.tests
, snp.rhs.tests
,
SingleSnpTests-class
,
SingleSnpTestsScore-class
,
GlmTests-class
data(testdata) tests <- single.snp.tests(cc, stratum=region, data=subject.data, snp.data=Autosomes, snp.subset=1:10) chi.squared(tests, 1) p.value(tests, 1)
data(testdata) tests <- single.snp.tests(cc, stratum=region, data=subject.data, snp.data=Autosomes, snp.subset=1:10) chi.squared(tests, 1) p.value(tests, 1)
snpMatrix
objects to snpStats
objects
These functions convert snpMatrix
objects to snpStats
objects. convert.snpMatrix
converts a single object, while
convert.snpMatrix.dir
converts all stored elements in a specified
directory. They really only change the class names since most of the
classes in snpStats
are backwards-compatible with
snpMatrix
. The exception is the ImputationRules
class;
imputation.rules
objects will need to be regenerated.
convert.snpMatrix(object) convert.snpMatrix.dir(dir = ".", ext = "RData")
convert.snpMatrix(object) convert.snpMatrix.dir(dir = ".", ext = "RData")
object |
Object to be converted |
dir |
A directory containing saved |
ext |
The file extension for files containing such objects |
convert.snpMatrix
returns the converted object.
convert.snpMatrix.dir
rewrites the files in place.
David Clayton [email protected]
The file example-new.txt
contains some signal intensity data for
testing and comparing genotype scoring algorithms
This is a text file containing data on 99 SNPs for 1550 DNA
samples. One line of data appears for each SNP, starting with the
SNP name and followed by 1550 pairs of intensity values. There is a
header line containing variable names, with intensities labelled
as xxxxA
and xxxxB
, where xxxx
is the sample
name.
See the package vignette "Comparing clustering algorithms".
These data were originally distributed with the "Illuminus" genotype scoring software from the Wellcome Trust Sanger Institute: http://www.sanger.ac.uk/resources/software/illuminus/
These data started life as real data derived from an affected sibling pair study of type 1 diabetes. However, original subject and SNP identidiers have been replaced by randomly chosen ones.
data(families)
data(families)
There are two objects in the loaded data file:
genotypes
: An object of class
"snp.matrix"
containing the SNP
genotype data for both parents and affected offspring
pedData
: A data frame containing the standard six fields
for a LINKAGE pedfile. The are named familyid
,
member
, father
, mother
sex
, and
affected
The two objects are linked by common row names.
Coding in the pedData
frame is as in the LINKAGE package,
except that missing data are coded NA
rather than zero
data(families) summary(genotypes) summary(pedData)
data(families) summary(genotypes) summary(pedData)
Determine which imputation rules are broken by removal of some SNPs from a study. This function is needed because, when if it emerges that genotyping of some SNPs is not reliable, necessitating their removal from study, we would also wish to remove any SNPs imputed on the basis of these unreliable SNPs.
filter.rules(rules, snps.excluded, exclusions = TRUE)
filter.rules(rules, snps.excluded, exclusions = TRUE)
rules |
An object of class |
snps.excluded |
The names of the SNPs whose removal is to be investigated |
exclusions |
If |
A character vector containing the names of imputed SNPs to be removed
David Clayton [email protected]
ImputationRules-class
,
snp.imputation
# No example yet
# No example yet
These data have been created artificially from publicly available datasets. The SNPs have been selected from those genotyped by the International HapMap Project (http://www.hapmap.org) to represent the typical density found on a whole genome association chip, (the Affymetrix 500K platform, http://www.affymetrix.com/support/technical/sample\_data/500k\_hapmap\_genotype\_data.affx for a moderately sized chromosome (chromosome 10). A study of 500 cases and 500 controls has been simulated allowing for recombination using beta software from Su and Marchini (http://www.stats.ox.ac.uk/~marchini/software/gwas/hapgen.html). Re-sampling of cases was weighted in such a way as to simulate three “causal” locus on this chromosome, with multiplicative effects of 1.3, 1.4 and 1.5 for each copy of the risk allele.
data(for.exercise)
data(for.exercise)
There are three data objects in the dataset:
snps.10
:
An object of class "SnpMatrix
"
containing a matrix of SNP genotype calls. Rows of the matrix
correspond to subjects and columns correspond to SNPs.
snp.support
:
A conventional R
data frame containing information about the
SNPs typed (the chromosome position and the nucleotides
corresponding to the two alleles of the SNP).
subject.support
:
A conventional R dataframe containing information about the study
subjects. There are two variables; cc
gives case/control
status (1=case), and stratum
gives ethnicity.
The data were obtained from the diabetes and inflammation laboratory (see http://www-gene.cimr.cam.ac.uk)
data(for.exercise) snps.10 summary(snps.10) summary(snp.support) summary(subject.support)
data(for.exercise) snps.10 summary(snps.10) summary(snp.support) summary(subject.support)
This function calculates the fixation index Fst for each SNP, together with its weight in the overall estimate (as used by the Internation HapMap Consortium).
Fst(snps, group, pairwise=FALSE)
Fst(snps, group, pairwise=FALSE)
snps |
an object of class |
group |
a factor (or object than can be coerced into a factor),
of length equal to the number of rows of |
pairwise |
if |
See vignette.
A list:
Fst values for each SNP
The weights for combining these into a single index
Uncertain genotypes are treated as missing
David Clayton [email protected]
## Analysis of some HapMap data data(for.exercise) f <- Fst(snps.10, subject.support$stratum) weighted.mean(f$Fst, f$weight)
## Analysis of some HapMap data data(for.exercise) f <- Fst(snps.10, subject.support$stratum) weighted.mean(f$Fst, f$weight)
Several commands depend on fitting a generalized linear model (GLM), using the standard iteratively reweighted least squares (IRLS) algorithm. This function sets various control parameters for this.
glm.test.control(maxit = 20, epsilon = 1.e-5, R2Max = 0.99)
glm.test.control(maxit = 20, epsilon = 1.e-5, R2Max = 0.99)
maxit |
Maximum number of IRLS steps |
epsilon |
Convergence threshold for IRLS algorithm |
R2Max |
R-squared limit for aliasing of new terms |
Sometimes (although not always), an iterative scheme is necessary to fit
a generalized linear model (GLM). The maxit
parameter sets
the maximum number of iterations to be carried out, while the
epsilon
parameter sets the criterion for determining
convergence. Variables which are judged to be "aliased" are dropped.
A variable is judged to be aliased if RSS/TSS is less than (1-R2Max),
where
RSS is the residual (weighted) sum of squares from the regression of that variable on the variables which precede it in the model formula (and any stratification defined in a strata() call in th emodel formula), and
TSS is the total (weighted) sum of squared deviations of this variable from its mean (or, when a strata() call is present, from its stratum-specific means).
The weights used in this calculation are the "working" weights of the IRLS algorithm.
Returns the parameters as a list in the expected order
David Clayton [email protected]
A simple class to hold output from
snp.lhs.estimates
and
snp.rhs.estimates
. Its main purpose is to provide a
show
method
Objects from this class are simple lists. Each element of the list is a list giving the results of a generalized linear model fit, with elements:
Name of the Y variable
The vector or parameter estimates (with their names)
The upper triangle of the variance-covariance matrix of estimates, stored as a simple vector
The number of "units" used in the model fit
Class "list"
, from data part.
Class "vector"
, by class "list", distance 2.
signature(x = "GlmEstimates", i = "ANY", j = "missing", drop = "missing")
: Subset
signature(from = "GlmEstimates", to =
"GlmTests")
: Calculate Wald tests
signature(object = "GlmEstimates")
: Display
Wald tests calculated by coercing an object of class
GlmEstimates
to class GlmTests
are asymptotically
equivalent to the "score" tests calculated by
snp.lhs.tests
and snp.rhs.tests
, and both types
of test are asymptotically correct. But the asymptotic properties of
Wald tests are not as good as those of score tests and, in
circumstances such as low sample size or low minor allele frequency,
Wald and score tests may differ substantially. In general score tests
are to be preferred, but Wald tests are provided in snpStats
because they are widely used.
David Clayton [email protected]
snp.lhs.estimates
,
snp.rhs.estimates
showClass("GlmEstimates")
showClass("GlmEstimates")
Classes of objects created by snp.lhs.tests
and snp.rhs.tests
. The class
"GlmTestsScore"
extends the class
"GlmTests"
and is invoked by setting the argument
score=TRUE
when calling testing functions
in order to save the scores and their variances
(and covariances)
Objects of class "GlmTests"
have four slots:
When only single SNPs are tested, a character vector of SNP names. Otherwise a list of such vectors (one for each test)
A character vector containing names of variables tested against SNPs
A numerical vector of chi-squared test values
An integer vector of degrees of freedom for the tests
A integer vector of the number of samples contributing to each test
The "GlmTestsScore"
class extends this, adding a slot
score
containing a list with elements which are themselves
lists with two elements:
The vector (or matrix) of efficient scores
The upper triangle of the variance-covariance matrix of
U
, stored as a vector
signature(x = "GlmTests", i = "ANY", j =
"missing", drop = "missing")
: Subsetting operator
signature(from = "GlmTests", to = "data.frame")
:
Simplify object
signature(x = "GlmTests", df =
"missing")
: Extract chi-squared test values
signature(x = "GlmTests")
: Extract
degrees of freedom for tests
signature(x="GlmTests")
: Extract (or generate)
a name for each test
signature(x = "GlmTests", df =
"missing")
: Extract p-values
signature(object = "GlmTests")
:
Extract sample sizes for tests
signature(object = "GlmTests")
: Show method
signature(object = "GlmTests")
: Summary
method
signature(x = "GlmTestsScore", i = "ANY", j =
"missing", drop = "missing")
: Subsetting operator
signature(x = "GlmTestsScore", simplify =
"logical")
: Extract signs of associations. If simpify
is
TRUE
then a simple vector is returned if all tests are on 1df
signature(x = "GlmTestsScore", y =
"GlmTestsScore", score = "logical")
: Combine results from two
sets of tests
signature(x = "GlmTestsScore", snps =
"character")
: Emulate, in the score vector and its (co)variances,
the effect of switching of the alleles of specified SNPs
Most of the methods for this class are shared with
the SingleSnpTests
and
SingleSnpTestsScore
classes
David Clayton [email protected]
snp.lhs.tests
,snp.rhs.tests
,
SingleSnpTests
,
SingleSnpTestsScore
showClass("GlmTests")
showClass("GlmTests")
This function counts, for all pairs of subjects and across all SNPs, the total number of alleles which are identical by state (IBS)
ibsCount(snps, uncertain = FALSE)
ibsCount(snps, uncertain = FALSE)
snps |
An input object of class |
uncertain |
If |
For each pair of subjects the function counts the total number of alleles which are IBS. For autosomal SNPs, each locus contributes 4 comparisons, since each subject carries two copies. For SNPs on the X chromosome, the number of comparisons is also 4 for female:female comparisons, but is 2 for female:male and 1 for male:male comparisons.
If there are N rows in the input matrix, the function returns an N*N matrix. The lower triangle contains the total number of comparisons and the upper triangle contains the number of these which are IBS. The diagonal contains the number of valid calls for each subject.
In genome-wide studies, the SNP data will usually be held as a series of
objects (of
class "SnpMatrix"
or "XSnpMatrix"
), one
per chromosome. Note that the matrices
produced by applying the ibsCount
function to each object in turn
can be added to yield the genome-wide result.
David Clayton [email protected]
ibsDist
which calculates a distance matrix based
on proportion of alleles which are IBS
data(testdata) ibs.A <- ibsCount(Autosomes[,1:100]) ibs.X <- ibsCount(Xchromosome)
data(testdata) ibs.A <- ibsCount(Autosomes[,1:100]) ibs.X <- ibsCount(Xchromosome)
Expresses a matrix of IBS counts (see ibsCount
) as a
distance matrix. The distance between two samples is returned as the
proportion of allele comparisons which are not IBS.
ibsDist(counts)
ibsDist(counts)
counts |
A matrix of IBS counts as produced by the function
|
An object of class "dist"
(see dist
)
David Clayton [email protected]
data(testdata) ibs <- ibsCount(Xchromosome) distance <- ibsDist(ibs)
data(testdata) ibs <- ibsCount(Xchromosome) distance <- ibsDist(ibs)
These functions extract key characteristics of regression-based
imputation rules stored as an object of class
"ImputationRules"
. imputation.maf
extracts the minor
allele frequencies of the imputed SNPs and imputation.r2
extracts
the prediction .
can.impute(rules) imputation.maf(rules) imputation.r2(rules) imputation.nsnp(rules)
can.impute(rules) imputation.maf(rules) imputation.r2(rules) imputation.nsnp(rules)
rules |
An object of class |
can.impute
returns a logical vector identifying which
rules allow a valid imputation.
imputation.maf
and imputation.r2
extract the
minor allele frequencies of the imputed SNPs and the for
prediction achieved when building each rule.
imputation.nsnp
returns the numbers of SNPs used in each imputation
Either a logical vector, or a numeric vector containing the extracted values
David Clayton [email protected]
ImputationRules-class
,
snp.imputation
# These functions are currently defined as function (rules) sapply(rules, function(x) x$maf) function (rules) sapply(rules, function(x) x$r2)
# These functions are currently defined as function (rules) sapply(rules, function(x) x$maf) function (rules) sapply(rules, function(x) x$r2)
A class defining a list "rules" for imputation of SNPs. Rules are estimated population haplotype probabilities for a target SNP and one or more predictor SNPs
Objects are lists of rules. Rules are named list elements each describing imputation of a SNP by a linear regression equation. Each element is itself a list with the following elements:
The minor allele frequency of the imputed SNP
The squared Pearson correlation coefficient between observed and predicted SNP duration derivation of the rule.
The names of the SNPs to be included in the regression.
A numerical array containing estimated probabilities for haplotypes of the SNP to be imputed and all the predictor SNPs
If any target SNP is monomorphic, the corresponding rule is
returned as NULL
. An object of class ImputationRules
has an attribute, Max.predictors
, which gives the maximum
number of predictors used for any imputation.
signature(object = "ImputationRules")
: prints
an abreviated listing of the rules
signature(object = "ImputationRules")
:
returns a table which shows the distribution of r-squared values
achieved against the number of snps used for imputation
signature(x="ImputationRules", y="missing")
:
plots the distribution of r-squared values as a stacked bar chart
signature(x = "ImputationRules", i = "ANY")
: subset
operations
David Clayton [email protected]
snp.imputation
, impute.snps
,
single.snp.tests
showClass("ImputationRules")
showClass("ImputationRules")
Given SNPs stored in an object of class "SnpMatrix"
or
"XSnpMatrix"
and a set of imputation rules in an object of
class "ImputationRules"
, this function calculates imputed values.
impute.snps(rules, snps, subset = NULL, as.numeric = TRUE)
impute.snps(rules, snps, subset = NULL, as.numeric = TRUE)
rules |
The imputation rules; an object of class
|
snps |
The object of class |
subset |
A vector describing the subset of subjects
to be used. If |
as.numeric |
If |
A matrix with imputed SNPs as columns. The imputed values are the estimated expected values of each SNP when coded 0, 1 or 2.
David Clayton [email protected]
Wallace, C. et al. (2010) Nature Genetics, 42:68-71
# Remove 5 SNPs from a datset and derive imputation rules for them data(for.exercise) sel <- c(20, 1000, 2000, 3000, 5000) to.impute <- snps.10[,sel] impute.from <- snps.10[,-sel] pos.to <- snp.support$position[sel] pos.fr <- snp.support$position[-sel] imp <- snp.imputation(impute.from, to.impute, pos.fr, pos.to) # Now calculate the imputed values imputed <- impute.snps(imp, impute.from)
# Remove 5 SNPs from a datset and derive imputation rules for them data(for.exercise) sel <- c(20, 1000, 2000, 3000, 5000) to.impute <- snps.10[,sel] impute.from <- snps.10[,-sel] pos.to <- snp.support$position[sel] pos.fr <- snp.support$position[-sel] imp <- snp.imputation(impute.from, to.impute, pos.fr, pos.to) # Now calculate the imputed values imputed <- impute.snps(imp, impute.from)
This function calculates measures of linkage disequilibrium between
pairs of SNPs. The two SNPs in each pair may both come from the same
SnpMatrix
object, or from two different SnpMatrix
objects. Statistics which can be calculated are the
log likelihood ratio, odds ratio, Yule's Q, covariance, D-prime,
R-squared, and R.
ld(x, y = NULL, depth = NULL, stats, symmetric = FALSE)
ld(x, y = NULL, depth = NULL, stats, symmetric = FALSE)
x |
An object of class |
y |
(Optional) Another object of the same class as |
depth |
When |
stats |
A character vector specifying the linkage disequilibrium measures to
be calculated. This should contain one or more of the strings: |
symmetric |
When no |
For each pair of SNPs, phased haplotype frequencies are first estimated
by maximum likelihood using the method described by Clayton and Leung
(2007). The arrays of chosen LD statistics are then calculated and
returned, either as
band matrices (when y
is not supplied), or as conventional
rectangular matrices (when y
is supplied). Band matrices are
stored in compressed form as objects of class dsCMatrix
(symmetric) or dgCMatrix
(upper triangular). These classes are
defined in the "Matrix"
package)
If only one LD statistic is requested, the function returns either a matrix or a compressed band matrix. If more than one LD statistic is requested, a list of such objects is returned
David Clayton [email protected]
Clayton and Leung (2007) Human Heredity, 64:45-51, (this paper is included in package documentation)
data(testdata) ld1 <- ld(Autosomes[, 1:50], depth=10, stats=c("D.prime", "R.squared")) ld2 <- ld(Autosomes[, 1:20], Autosomes[, 21:25], stats="R.squared")
data(testdata) ld1 <- ld(Autosomes[, 1:50], depth=10, stats=c("D.prime", "R.squared")) ld2 <- ld(Autosomes[, 1:20], Autosomes[, 21:25], stats="R.squared")
This R data file contains data from the International HapMap project, concerning 603 SNPs spanning a one megabase region on chromosome 22, in a sample of Europeans and a sample of Africans
There are three objects in the file:
ceph.1m
: A snpMatrix
object containing the
European genotype data
yri.1m
: A snpMatrix
object containing the
African genotype data
support.ld
: A dataframe containing details (chromosome
position etc.
of the 603 SNPs
http://hapmap.ncbi.nlm.nih.gov
The International HapMap Consortium. The International HapMap Project. Nature 426:789-796 (2003)
An uncertain SNP genotype call is represented by the three posterior
probabilities of the three possible calls. In the class
SnpMatrix
, an approximation to these is packed into a single
1-byte variable of type raw
. These functions carry out this coding (and
decoding).
post2g(p, transpose = FALSE) mean2g(m, maxE = FALSE) g2post(g, transpose = FALSE)
post2g(p, transpose = FALSE) mean2g(m, maxE = FALSE) g2post(g, transpose = FALSE)
p |
A matrix of posterior probabilities. If |
m |
A vector of posterior means |
g |
A |
transpose |
A logical flag indicating transposition of the matric of posterior probabilities (see Description) |
maxE |
A logical flag selecting the maximum entropy option in
|
post2g
and g2post
convert from posterior probabilities to
raw
code and back respectively. If only the posterior expectation
of the genotype (when numerically coded 0, 1, or 2) is available, no
unique soultion exists in general and the behaviour of the function is
determined by the value of maxE
. If TRUE
, then the maximum
entropy solutions are returned while, if FALSE
, an attempt is made
to return the least uncertain solution, by setting the posterior
probability of the BB
genotype to zero when the posterior mean is
less than 1.0 and the probability of AA
to zero when the mean is
greater than 1.0.
post2g
and mean2g
return a vector of type
raw
. g2post
returns a numeric matrix.
These functions are provided mainly for users wishing to write their own data input functions.
David Clayton [email protected]
data(testdata) g <- Autosomes[1:10, 20] ## A vector of codes p <- g2post(g) ## Transform to probabilities ... pg <- post2g(p) ## ... and back to codes m <- p[,2]+2*p[,3] ## Posterior expectations mg <- mean2g(m) ## Transform to codes ... pmg <- g2post(mg) ## ... and transform to probabilities ## Write everything out print(cbind(as(g, "numeric"), p, as.numeric(pg), m, as.numeric(mg), pmg))
data(testdata) g <- Autosomes[1:10, 20] ## A vector of codes p <- g2post(g) ## Transform to probabilities ... pg <- post2g(p) ## ... and back to codes m <- p[,2]+2*p[,3] ## Posterior expectations mg <- mean2g(m) ## Transform to codes ... pmg <- g2post(mg) ## ... and transform to probabilities ## Write everything out print(cbind(as(g, "numeric"), p, as.numeric(pg), m, as.numeric(mg), pmg))
For SNP data in families, this function locates all subjects whose parents are in the dataset and tests each SNP for non-Mendelian inheritances in these trios.
misinherits(ped, id, father, mother, data = sys.parent(), snp.data)
misinherits(ped, id, father, mother, data = sys.parent(), snp.data)
ped |
Pedigree identifiers |
id |
Subject identifiers |
father |
Identifiers for subjects' fathers |
mother |
Identifiers for subjects' mothers |
data |
A data frame in which to evaluate the previous four arguments |
snp.data |
An object of class |
The first four arguments are usually derived from a "pedfile". If a
data frame is supplied for the data
argument, the first four
arguments will be evaluated in this frame. Otherwise they will be evaluated
in the calling environment. If the arguments are missing, they will be
assumed to be in their usual positions in the pedfile data frame
i.e. in columns one to four. If the pedfile data are obtained from
a dataframe, the row names of the data
and snp.data
files will be used to align the pedfile and SNP data. Otherwise, these
vectors will be assumed to be in the same order as the rows of
snp.data
.
A logical matrix. Rows are subjects with any non-Mendelian inheritances and columns are SNPs with any non-Mendelian inheritances. The body of the matrix details whether each subject has non-Mendelian inheritance at each SNP. If a subject has no recorded genotype for a specific SNP, the corresponding element of the output matrix is set to NA.
David Clayton [email protected]
data(families) misinherits(data=pedData, snp.data=genotypes)
data(families) misinherits(data=pedData, snp.data=genotypes)
This function calculates multivariate score tests between a multivariate (or multinomial) phenotype and sets of SNPs
mvtests(phenotype, sets, stratum, data = sys.parent(), snp.data, rules = NULL, complete = TRUE, uncertain = FALSE, score = FALSE)
mvtests(phenotype, sets, stratum, data = sys.parent(), snp.data, rules = NULL, complete = TRUE, uncertain = FALSE, score = FALSE)
phenotype |
Either a factor (for a multinomial phenotype) or a matrix (for a multivariate phenotype) |
sets |
A list of sets of SNPs to be tested against the phenotype |
stratum |
(Optional) a stratifying variable |
data |
A data frame in which |
snp.data |
An object of class |
rules |
(Optional) A set of imputation rules. The function then carries out tess on imputed SNPs |
complete |
If |
uncertain |
If |
score |
If |
Currently complete=FALSE
is not implemented
An object of class snp.tests.glm
or GlmTests.score
depending on whether score
is set to FALSE
or TRUE
in the call
This is an experimental version
David Clayton [email protected]
## No example yet
## No example yet
The snpStats package allows for storage of uncertain genotype assignments in a one byte "raw" variable. The probabilities of assignment form a three-vector, subject to the linear constraint that they sum to 1.0; their possible values are grouped into 253 different classes. This function displays counts of these classes on a two-dimensional isometric plot.
plotUncertainty(snp, nlevels = 10, color.palette = heat.colors(nlevels))
plotUncertainty(snp, nlevels = 10, color.palette = heat.colors(nlevels))
snp |
One or more columns of a |
nlevels |
Probability cells are coloured according to frequency. This argument gives the number of colours that can be used |
color.palette |
The colour palette to be used |
The plot takes the form of an equilateral triangle in which each apex represents a certain assignment to one of the three genotypes. A point within the triangle represents, by the perpendicular distance from each side, the three probabilities. Each of the 253 probability classes is represented by a hexagonal cell, coloured according to its frequency in the data, which is also written within the cell
David Clayton [email protected]
## No example available yet
## No example available yet
Given the same set of "score" tests carried out in several studies or
in several different sub-samples within a study, this function pools
the evidence by summation of the score statistics and score
variances. It combines tests produced by
single.snp.tests
or by snp.lhs.tests
and
snp.rhs.tests
.
pool(..., score = FALSE)
pool(..., score = FALSE)
... |
Objects holding the (extended) test results.
These must be of class
|
score |
Is extended score information to
be returned in the output object? Relevant only for
|
This function works by recursive calls to the generic function
pool2
which pools the results of two studies.
An object of same class as the input objects (optionally without the
.score
) extension. Tests are produced for the union of
SNPs tested in all the input objects.
David Clayton [email protected]
pool2
,
SingleSnpTestsScore-class
,
GlmTests-class
,
single.snp.tests
,
snp.lhs.tests
,
snp.rhs.tests
# An artificial example which simply doubles the size of a study data(testdata) sst <- single.snp.tests(snp.data=Autosomes, cc, data=subject.data, score=TRUE) sst2 <- pool(sst, sst) summary(sst2)
# An artificial example which simply doubles the size of a study data(testdata) sst <- single.snp.tests(snp.data=Autosomes, cc, data=subject.data, score=TRUE) sst2 <- pool(sst, sst) summary(sst2)
Generic function to pool results of tests from two independent
datasets. It is not designed to be called directly, but is called
recursively by pool
pool2(x, y, score)
pool2(x, y, score)
x , y
|
Objects holding the (extended) test results.
These must be of class
|
score |
Is extended score information to be returned in the output object? |
An object of same class as the input objects (optionally without the
.score
) extension. Tests are produced for the union of
SNPs tested in all the input objects.
David Clayton [email protected]
pool
,
SingleSnpTestsScore-class
,
GlmTests-class
,
single.snp.tests
,
snp.lhs.tests
,
snp.rhs.tests
In snpStats
, the three "posterior" probabilities corresponding to
the possible values of an uncertain genotype are packed into a single
byte code (with, of course, some loss in accuracy). This function, which
is provided as an aid to writing new functions, unpacks the posterior
probabilities from the single byte codes.
pp(x, transpose = FALSE)
pp(x, transpose = FALSE)
x |
A vector, length N, which can be coerced into type |
transpose |
If |
A numeric matrix
David Clayton [email protected]
## ## Read imputed data from a file produced by MACH ## path <- system.file("extdata/mach1.out.mlprob.gz", package="snpStats") mach <- read.mach(path) pp(mach[1:50, 10])
## ## Read imputed data from a file produced by MACH ## path <- system.file("extdata/mach1.out.mlprob.gz", package="snpStats") mach <- read.mach(path) pp(mach[1:50, 10])
This function plots ranked observed chi-squared test statistics against the corresponding expected order statistics. It also estimates an inflation (or deflation) factor, lambda, by the ratio of the trimmed means of observed and expected values. This is useful for inspecting the results of whole-genome association studies for overdispersion due to population substructure and other sources of bias or confounding.
qq.chisq(x, df=1, x.max, main="QQ plot", sub=paste("Expected distribution: chi-squared (",df," df)", sep=""), xlab="Expected", ylab="Observed", conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5, slope.one=FALSE, slope.lambda=FALSE, pvals=FALSE, thin=c(0.25,50), oor.pch=24, col.shade="gray", ...)
qq.chisq(x, df=1, x.max, main="QQ plot", sub=paste("Expected distribution: chi-squared (",df," df)", sep=""), xlab="Expected", ylab="Observed", conc=c(0.025, 0.975), overdisp=FALSE, trim=0.5, slope.one=FALSE, slope.lambda=FALSE, pvals=FALSE, thin=c(0.25,50), oor.pch=24, col.shade="gray", ...)
x |
A vector of observed chi-squared test values |
df |
The degreees of freedom for the tests |
x.max |
If present, truncate the observed value (Y) axis at
|
main |
The main heading |
sub |
The subheading |
xlab |
x-axis label (default "Expected") |
ylab |
y-axis label (default "Observed") |
conc |
Lower and upper probability bounds for concentration band
for the plot. Set this to |
overdisp |
If |
trim |
Quantile point for trimmed mean calculations for estimation of lambda. Default is to trim at the median |
slope.one |
Is a line of slope one to be superimpsed? |
slope.lambda |
Is a line of slope lambda to be superimposed? |
pvals |
Are P-values to be indicated on an axis drawn on the right-hand side of the plot? |
thin |
A pair of numbers indicating how points will be thinned
before plotting (see Details).
If |
oor.pch |
Observed values greater than |
col.shade |
The colour with which the concentration band will be filled |
... |
Further graphical parameter settings to be passed to
|
To reduce plotting time and the size of plot files, the smallest
observed and expected points are thinned so that only a reduced number of
(approximately equally spaced) points are plotted. The precise
behaviour is controlled by the parameter
thin
, whose value should be a pair of numbers.
The first number must lie
between 0 and 1 and sets the proportion of the X axis over which
thinning is to be applied. The second number should be an integer and
sets the maximum number of points to be plotted in this section.
The "concentration band" for the plot is shown in grey. This region is defined by upper and lower probability bounds for each order statistic. The default is to use the 2.5 Note that this is not a simultaneous confidence region; the probability that the plot will stray outside the band at some point exceeds 95
When required, the dispersion factor is estimated by the ratio of the observed trimmed mean to its expected value under the chi-squared assumption.
The function returns the number of tests, the number of values omitted
from the plot (greater than x.max
), and the estimated
dispersion factor, lambda.
All tests must have the same number of degrees of freedom. If this is not the case, I suggest transforming to p-values and then plotting -2log(p) as chi-squared on 2 df.
David Clayton [email protected]
Devlin, B. and Roeder, K. (1999) Genomic control for association studies. Biometrics, 55:997-1004
single.snp.tests
, snp.lhs.tests
,
snp.rhs.tests
## See example the single.snp.tests() function
## See example the single.snp.tests() function
This function is purely for testing purposes. It can generate SnpMatrix objects which contain more than 2^31-1 elements.
random.snps(nrows, ncols)
random.snps(nrows, ncols)
nrows |
The number of rows to be generated |
ncols |
The number of columns to be generated |
All SNPs should be in Hardy-Weinberg equilibrium with an allele frequency of 0.5.
Note that, although the total number of elements can exceed 2^31-1, the numbers of rows and columns are still subject to this limit.
x <- random.snps(100,10) col.summary(x)
x <- random.snps(100,10) col.summary(x)
The BEAGLE program generates, for each SNP and each subject,
posterior probabilities for the three genotypes. This function reads
such data as a SnpMatrix
object, storing the posterior
probabilities to as much accuracy allowed by a one-byte coding
read.beagle(file, rownames=NULL, nsnp = NULL, header=TRUE)
read.beagle(file, rownames=NULL, nsnp = NULL, header=TRUE)
file |
The input file name. This file my be gzipped. |
rownames |
The row names (sample identifiers) for the matrix |
nsnp |
The number of SNPs to be read in. This corresponds with the number of lines in the input file. If not supplied, the function does a preliminary pass to determine the number of lines |
header |
Set this |
In later versions of BEAGLE, row names are listed on a header line.
However, if the rownames
argument is supplied, this will
take precedence over the header line. If there is no header
line and no row names are supplied, names
are generated as Sample1
, Sample2
etc.
No provision is made for data for the X chromosome. Such data must be
first read as a SnpMatrix
and subsequently coerced to an
XSnpMatrix
object
an object of class SnpMatrix
David Clayton [email protected]
##---- No example available yet
##---- No example available yet
The IMPUTE2 program generates, for each SNP and each subject,
posterior probabilities for the three genotypes. This function reads
such data as a SnpMatrix
object, storing the posterior
probabilities to as much accuracy allowed by a one-byte coding
read.impute(file, rownames = NULL, nsnp = NULL, snpcol = 2)
read.impute(file, rownames = NULL, nsnp = NULL, snpcol = 2)
file |
The input file name. This file my be gzipped. |
rownames |
The row names for the output object. Note that these correspond to
groups of three columns in the input file. If not supplied, names
are generated as |
nsnp |
The number of SNPs to be read in. This corresponds with the number of lines in the input file. If not supplied, the function does a preliminary pass to determine the number of lines |
snpcol |
Which column of the input will be used as the SNP name. Default is column 2 |
No provision is made for data for the X chromosome. Such data must be
first read as a SnpMatrix
and subsequently coerced to an
XSnpMatrix
object
an object of class SnpMatrix
David Clayton [email protected]
##---- No example available yet
##---- No example available yet
This function reads SNP genotype data from a file in which each line
refers to a single genotype call. Replaces the earlier function
read.snps.long
.
read.long(file, samples, snps, fields = c(snp = 1, sample = 2, genotype = 3, confidence = 4, allele.A = NA, allele.B = NA), split = "\t| +", gcodes, no.call = "", threshold = NULL, lex.order = FALSE, verbose = FALSE)
read.long(file, samples, snps, fields = c(snp = 1, sample = 2, genotype = 3, confidence = 4, allele.A = NA, allele.B = NA), split = "\t| +", gcodes, no.call = "", threshold = NULL, lex.order = FALSE, verbose = FALSE)
file |
Name(s) of file(s) to be read (can be gzipped) |
samples |
Either a vector of sample identifiers, or the number of samples to be read. If a single file is to be read and this argument is omitted, the file will be scanned initially and all samples will be included |
snps |
Either a vector of SNP identifiers, or the number of SNPs to be read. If a single file is to be read and this argument is omitted, the file will be scanned initially and all SNPs will be included |
fields |
A named vector giving the locations of the required fields. See Details below |
split |
A regular expression specifying how the input line will be split into fields. The default value specifies separation of fields by a TAB character, or by one or more blanks |
gcodes |
When the genotype is read as a single field, this argument specifies how it is handled. See Details below. |
no.call |
The string which indicates "no call" for either a genotype or (when the genotype is read as two allele fields) an allele |
threshold |
A vector of length 2 giving the lower and higher acceptable limits for the confidence score |
lex.order |
If |
verbose |
If |
Each line on the input file represents a single call and is split into
fields using the function strsplit
. The required fields are
extracted according to the fields
argument. This must
contain the locations of the sample and snp identifier
fields and either the location of a genotype field or the
locations of two allele fields.
If the samples
and snps
arguments contain vectors of
character strings, a SnpMatrix
is created with these row and
column names and the genotype values are "cherry-picked" from the input
file. If either, or both, of these arguments are specified simply as
numbers, then these
numbers determine the dimensions of the SnpMatrix
created. In this case samples and/or SNPs are included in the
SnpMatrix
on a first-come-first-served basis. If either
or both of these arguments are omitted, a preliminary scan of the input file
is carried out to find the missing sample and/or SNP identifiers.
In this scan,
when a sample or SNP identifier differs from that in the previous
line, but is identical to one previously found, then all the relevant
identifiers are assumed to have been found. This implies that
the file must be sorted, in some consistent order,
by sample and by SNP (although either one of these may vary fastest).
If the genotype is to be read as a single field, the genotype
element of the fields
argument must be set to the appropriate
value, and the allele.A
and allele.B
elements should be
set to NA
. Its handling is controlled
by the gcodes
argument. If this is missing or NA
, then
the genotype is assumed to be represented by a two-character field,
the two characters representing the two alleles. If gcodes
is
a single string, then it is assumed to contain
a regular expression which will split the genotype field into two allele
fields. Otherwise, gcode
must be an array of length three,
specifying the three genotype codes in the order "AA", "AB", "BB".
If the two alleles of the genotype are to be read from two separate
fields, the genotype
element should be set to NA
and the
allele.A
and allele.B
elements set to the appropriate
values. The gcode
argument should be missing or set to NA
.
If the genotype is read as a single field matching one of three
specified codes, the function returns an object of class
SnpMatrix
. Otherwise it returns a list whose first element is the
SnpMatrix
object and whose second element is a dataframe
containing the allele codes, with the SNP identifiers as row names. Note
that allele codes only occur in this file if they occur in a genotype
which was accepted. Thus, monomorphic SNPs have allele.B
coded as
NA
, and SNPs which never pass confidence score filters have both
alleles coded as NA
.
Unlike read.snps.long
,
this function is written entirely in R and may not be particularly
fast. However, it imposes no restrictions on the allele codes
recognized.
Homozygous genotypes are assumed to be represented in the input file
by coding both alleles to the same value. No special provision is made
to read XSnpMatrix
objects; such data should first be read as a SnpMatrix
and then
coerced to an XSnpMatrix
using new
or as
.
David Clayton [email protected]
SnpMatrix-class
, XSnpMatrix-class
## ## No example supplied yet ##
## ## No example supplied yet ##
This routine reads imputed genotypes generated by the MACH program. With
the --mle
and --mldetails
options in force this program
generates a
.mlprob
output file which contains probabilities of
assignments. These are stored as uncertain genotype calls in a
SnpMatrix
object
read.mach(file, colnames = NULL, nrow = NULL)
read.mach(file, colnames = NULL, nrow = NULL)
file |
The name of the |
colnames |
The column names. If absent, names are generated as
|
nrow |
If known the number of rows of data on the file. If not supplied, it is determined by a preliminary pass through the data |
No routine is explicitly available for data on chromosome X. Such data
should first be read as a SnpMatrix
and then coerced to an
XSnpMatrix
object
An object of class SnpMatrix
David Clayton [email protected]
##---- No example available yet
##---- No example available yet
"SnpMatrix"
object
Reads diallelic data in linkage "pedfile" format, with one line of data per sample (subject) containing six mandatory fields followed by pairs of fields, one pair for each locus, giving the two alleles observed.
read.pedfile(file, n, snps, which, split = "\t| +", sep = ".", na.strings = "0", lex.order = FALSE)
read.pedfile(file, n, snps, which, split = "\t| +", sep = ".", na.strings = "0", lex.order = FALSE)
file |
The input pedfile. This may be (but need not be) gzipped |
n |
(Optional) The number of lines of data to be read. If not supplied the pedfile is read once and rewound to determine how many lines it contains |
snps |
(Optional) Either a character vector giving the names of the loci,
or a single character variable giving the name of a locus
information file from which these can be read. This file is assumed
to be white-space delimited with one line per locus and no header
line. If this argument is not supplied, locus names are generated as
a numerical sequence, prefixed by |
which |
(Optional) If locus names are to be read from a file, this argument should specify which column contains the names. If not supplied, the first column giving unique locus names is used |
split |
A "regexp" specifying how the input pedfile will be split into fields. The default value specifies either a TAB character or one or more spaces |
sep |
The separator character used in constructing row and column names of
the output |
na.strings |
One or more strings to be set to |
lex.order |
If TRUE, then alleles will be allocated to internal |
Row names for the output SnpMatrix
object and for the
accompanying subject description dataframe are taken as the pedigree
identifiers, when these provide the required unique identifiers. When
these are duplicated, an attempt is made to use the pedigree-member
identifiers instead but, when these too are duplicated,
row names are obtained by concatenating, with a separator character, the
pedigree and pedigree-member identifiers.
A list, comprising
genotypes |
The output genotype data as an object of class
|
fam |
A dataframe containing the first six fields in the
pedfile. The row names will correspond with those of the |
map |
A dataframe giving the alleles at each locus. If locus names
were obtained from a dataframe read from an existing file, then the
allele information is simply appended to this frame. Otherwise a new
dataframe is created. The row names will correspond with the column
names of the |
This function is written entirely in R and may not be particularly fast. However, it imposes no restrictions on the allele codes recognized.
Homozygous genotypes may be represented in the input file either (a)
by coding both alleles to the same value, or (b) setting the second
allele to "missing" (as specified by the missing.allele
argument). No special provision is made to read XSnpMatrix
objects; such data should first be read as a SnpMatrix
and then
coerced to an XSnpMatrix
using new
or as
.
David Clayton [email protected]
SnpMatrix-class
, XSnpMatrix-class
## ## No example supplied yet ##
## ## No example supplied yet ##
The package PLINK saves genome-wide association data in groups of three
files, with the extensions .bed
, .bim
, and
.fam
. This function reads these files and creates an object of
class "SnpMatrix"
read.plink(bed, bim, fam, na.strings = c("0", "-9"), sep = "." , select.subjects = NULL, select.snps = NULL)
read.plink(bed, bim, fam, na.strings = c("0", "-9"), sep = "." , select.subjects = NULL, select.snps = NULL)
bed |
The name of the
file containing the packed binary SNP genotype data. It should have
the extension |
bim |
The file containing the SNP descriptions |
fam |
The file containing subject (and, possibly, family) identifiers. This is basically a tab-delimited "pedfile" |
na.strings |
Strings in .bam and .fam files to be recoded as NA |
sep |
A separator character for constructing unique subject identifiers |
select.subjects |
A numeric vector indicating a subset of subjects to be selected from the input file (see details) |
select.snps |
Either a numeric or a character vector indicating a subset of SNPs to be selected from the input file (see details) |
If the bed
argument does not contain a filename with the file
extension .bed
, then this extension is appended to the
argument. The remaining two arguments are optional; their default
values are obtained by replacing the .bed
filename extension by
.bim
and .fam
respectively. See the PLINK documentation
for the detailed specification of these files.
The select.subjects
or select.snps
argument can be used
to read a subset of the data. Use of select.snps
requires that
the .bed
file is in SNP-major order (the default in
PLINK). Likewise, use of select.snps
requires that
the .bed
file is in individual-major order. Subjects are
selected by their numeric order in the PLINK files, while SNPs are
selected either by order or by name. Note that
the order of selected SNPs/subjects in the output objects
will be the same as
their order in the PLINK files.
Row names for the output SnpMatrix
object and for the
accompanying subject description dataframe are taken as the pedigree
identifiers, when these provide the required unique identifiers. When
these are duplicated, an attempt is made to use the pedigree-member
identifiers instead but, when these too are duplicated,
row names are obtained by concatenating, with a separator character, the
pedigree and pedigree-member identifiers.
A list with three elements:
genotypes |
The output genotype data as an object of class |
"SnpMatrix"
.
fam |
A dataframe corresponding to the |
map |
A dataframe correponding to the |
No special provision is made to read XSnpMatrix
objects; such data should first be read as a SnpMatrix
and then
coerced to an XSnpMatrix
using new
or as
.
David Clayton [email protected]
PLINK: Whole genome association analysis toolset. http://pngu.mgh.harvard.edu/~purcell/plink/
write.plink
,
SnpMatrix-class
, XSnpMatrix-class
Reads SNP data when organized in free format
as one call per line. Other than the one
call per line requirement, there is considerable flexibility. Multiple
input files can be read, the input fields can be in any order on the
line, and irrelevant fields can be skipped. The samples and SNPs
to be read must be pre-specified, and define rows and columns of an
output object of class "SnpMatrix"
. This function has been
replaced in versions 1.3 and later by the more flexible function
read.long
.
read.snps.long(files, sample.id = NULL, snp.id = NULL, diploid = NULL, fields = c(sample = 1, snp = 2, genotype = 3, confidence = 4), codes = c("0", "1", "2"), threshold = 0.9, lower = TRUE, sep = " ", comment = "#", skip = 0, simplify = c(FALSE,FALSE), verbose = FALSE, in.order=TRUE, every = 1000)
read.snps.long(files, sample.id = NULL, snp.id = NULL, diploid = NULL, fields = c(sample = 1, snp = 2, genotype = 3, confidence = 4), codes = c("0", "1", "2"), threshold = 0.9, lower = TRUE, sep = " ", comment = "#", skip = 0, simplify = c(FALSE,FALSE), verbose = FALSE, in.order=TRUE, every = 1000)
files |
A character vector giving the names of the input files |
sample.id |
A character vector giving the identifiers of the samples to be read |
snp.id |
A character vector giving the names of the SNPs to be read |
diploid |
A logical array of the same length as
|
fields |
A integer vector with named elements specifying the
positions of the required fields in the input record. The fields are
identified by the names |
codes |
Either the single string |
threshold |
A numerical value for the calling threshold on the confidence score |
lower |
If |
sep |
The delimiting character separating fields in the input record |
comment |
A character denoting that any remaining input on a line is to be ignored |
skip |
An integer value specifying how many lines are to be skipped at the beginning of each data file |
simplify |
If |
verbose |
If |
in.order |
If |
every |
See |
If nucleotide coding is not used, the codes
argument
should be a character array giving the valid codes.
For genotype coding of autosomal SNPs, this should be
an array of length 3 giving the codes
for the three genotypes, in the order homozygous(AA), heterozygous(AB),
homozygous(BB). All other codes will be treated
as "no call". The default codes are "0"
, "1"
,
"2"
. For X SNPs, males are assumed to be coded as homozygous,
unless an additional two codes are supplied (representing the
AY and BY genotypes). For allele coding, the
codes
array should be of length 2 and should specify the codes
for the two alleles. Again, any other code is treated as
"missing" and, for X SNPs, males should be coded either as
homozygous or by omission of the second allele.
For nucleotide coding, nucleotides are assigned to the nominal alleles in alphabetic order. Thus, for a SNP with either "T" and "A" nucleotides in the variant position, the nominal genotypes AA, AB and BB will refer to A/A, A/T and T/T.
Although the function allows for reading into an object of class
XSnpMatrix
directly,
it is usually preferable to read such data as a "SnpMatrix"
(i.e. as autosomal) and to coerce it to an object of type
"XSnpMatrix"
later using as(..., "X.SnpMatrix")
or
new("XSnpMatrix", ..., diploid=...)
. If diploid
is coded NA
for any subject the latter course must be
followed, since NA
s are not accepted in the diploid
argument.
If the in.order
argument is set TRUE
, then
the vectors sample.id
and snp.id
must be in the same
order as they vary on the input file(s) and this ordering must be
consistent. However, there is
no requirement that either SNP or sample should vary fastest as this is
detected from the input. If in.order
is FALSE
, then no
assumptions about the ordering of the input file are assumed and SNP
and sample identifiers are looked up in hash tables as they are
read. This option must be expected, therefore, to be somewhat slower.
Each file may represent a separate sample or SNP, in which case the
appropriate .id
argument can be omitted; row or column names
are then taken from the file names.
An object of class "SnpMatrix"
or "XSnpMatrix"
.
The function will read gzipped files.
If in.order
is TRUE
,
every combination of sample and snp listed in the
sample.id
and snp.id
arguments must be present in the
input file(s). Otherwise the function will search for any missing
observation until reaching the end of the data, ignoring everything
else on the way.
David Clayton [email protected]
read.plink
,
SnpMatrix-class
, XSnpMatrix-class
These function calculates summary statistics of each row or column of
call rates and heterozygosity for each row of a an
object of class "SnpMatrix"
or "XSnpMatrix"
row.summary(object) col.summary(object, rules = NULL, uncertain = TRUE)
row.summary(object) col.summary(object, rules = NULL, uncertain = TRUE)
object |
genotype data as a |
rules |
An object of class
|
uncertain |
If |
row.summary |
returns a data frame with rows corresponding to rows of the input object and with columns/elements:
Uncertain calls are ignored for calculating the heterozygosity. |
col.summary |
returns a data frame with rows corresponding to columns of the input object and with columns/elements:
For objects of class
|
The current version of row.summary
does not deal with the X chromosome
differently, so that males are counted as homozygous.
David Clayton [email protected]
data(testdata) rs <- row.summary(Autosomes) summary(rs) cs <- col.summary(Autosomes) summary(cs) cs <- col.summary(Xchromosome) summary(cs)
data(testdata) rs <- row.summary(Autosomes) summary(rs) cs <- col.summary(Autosomes) summary(cs) cs <- col.summary(Xchromosome) summary(cs)
The first five files concern data on 20 diallelic loci on 120 subjects. These data are distributed with the Haploview package (Barrett et al., 2003). The sixth files contains a additional dataset of 18 SNPs in 100 subjects, coded in "long" format, and the seventh file duplicates this dataset in an alternative long format. These seven files are used in the data input vignette. The final file is a sample imputed genotype dataset distributed with the MACH imputation package, and used in the imputation vignette.
These files are stored in the extdata
relative to the
package base. Full file names can be obtained using the
system.file
function.
The following files are described here:
sample.ped.gz
: A gzipped pedfile
sample.info
: An accompanying locus information file
sample.bed
: The corresponding PLINK .bed
file
sample.bim
: The PLINK .bim
file
sample.fam
: The PLINK .fam
file
sample-long.gz
: A sample of long-formatted data
sample-long-alleles.gz
: The same as above, but allele-coded
mach1.out.mlprob.gz
: An mlprob
output file from
the MACH genotype imputation program. This file contains, for each
imputed genotype call, posterior probabilities for the three possible
genotypes
http://www.broadinstitute.org/scientific-community/science/programs/medical-and-population-genetics/haploview/downloads http://www.sph.umich.edu/csg/abecasis/MACH/download
Barrett JC, Fry B, Maller J, Daly MJ.(2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 2005 Jan 15, [PubMed ID: 15297300]
This function carries out tests for association between phenotype and a series of single nucleotide polymorphisms (SNPs), within strata defined by a possibly confounding factor. SNPs are considered one at a time and both 1-df and 2-df tests are calculated. For a binary phenotype, the 1-df test is the Cochran-Armitage test (or, when stratified, the Mantel-extension test). The function will also calculate the same tests for SNPs imputed by regression analysis.
single.snp.tests(phenotype, stratum, data = sys.parent(), snp.data, rules=NULL, subset, snp.subset, uncertain = FALSE, score=FALSE)
single.snp.tests(phenotype, stratum, data = sys.parent(), snp.data, rules=NULL, subset, snp.subset, uncertain = FALSE, score=FALSE)
phenotype |
A vector containing the values of the phenotype |
stratum |
Optionally, a factor defining strata for the analysis |
data |
A dataframe containing the |
snp.data |
An object of class |
rules |
An object of class
|
subset |
A vector or expression describing the subset of subjects
to be used in the analysis. This is evaluated in the same
environment as the |
snp.subset |
A vector describing the subset of SNPs to be
considered. Default action is to test all SNPs in |
uncertain |
If TRUE, uncertain genotypes are handled by replacing
score contributions by their posterior expectations. Otherwise they
are treated as missing. Setting this option authomatically invokes
use of |
score |
If |
Formally, the test statistics are score tests for generalized linear models with canonical link. That is, they are inner products between genotype indicators and the deviations of phenotypes from their stratum means. Variances (and covariances) are those of the permutation distribution obtained by randomly permuting phenotype within stratum.
When the function is used to calculate tests for imputed SNPs, the test is still a score test. The score statistics are calculated from the expected value, given observed SNPs, of the score statistic if the SNP to be tested were itself observed.
The subset
argument can either be a logical vector of length
equal to the length of the vector of phenotypes, an integer vector
specifying positions in the data
frame, or a character vector
containing names of the selected rows in the data
frame. Similarly, the snp.subset
argument can be a logical,
integer, or character vector.
An object of class
"SingleSnpTests"
.
If score
is set to TRUE
,
the output object will be of the extended class
"SingleSnpTestsScore"
containing additional slots holding the score statistics and their
variances (and covariances). This allows meta-analysis using the
pool
function.
The 1 df imputation tests are described by Chapman et al. (2008)
and the 2 df imputation tests are a simple extension of these.
The behaviour of this function for objects of class
XSnpMatrix
is as described by Clayton (2008). Males are
treated as homozygous females and corrected variance estimates are
used.
David Clayton [email protected]
Chapman J.M., Cooper J.D., Todd J.A. and Clayton D.G. (2003)
Human Heredity, 56:18-31.
Clayton (2008) Testing for association on the X chromosome
Biostatistics, 9:593-600.)
snp.lhs.tests
, snp.rhs.tests
,
impute.snps
, ImputationRules-class
,
pool
,
SingleSnpTests-class
,
SingleSnpTestsScore-class
data(testdata) results <- single.snp.tests(cc, stratum=region, data=subject.data, snp.data=Autosomes, snp.subset=1:10) print(summary(results)) # writing to an (anonymous and temporary) csv file csvfile <- tempfile() write.csv(file=csvfile, as(results, 'data.frame')) unlink(csvfile) # QQ plot qq.chisq(chi.squared(results, 1), 1) qq.chisq(chi.squared(results, 2), 2)
data(testdata) results <- single.snp.tests(cc, stratum=region, data=subject.data, snp.data=Autosomes, snp.subset=1:10) print(summary(results)) # writing to an (anonymous and temporary) csv file csvfile <- tempfile() write.csv(file=csvfile, as(results, 'data.frame')) unlink(csvfile) # QQ plot qq.chisq(chi.squared(results, 1), 1) qq.chisq(chi.squared(results, 2), 2)
These are classes to hold the objects created by
single.snp.tests
and provide methods for extracting key
elements. The class "SingleSnpTestsScore"
extends class
"SingleSnpTests"
to include the score and score variance statistics in order to
provide methods for pooling results from several studies or parts of a study
Objects can be created by calls of the form
new("SingleSnpTests", ...)
and
new("SingleSnpTestsScore", ...)
but, more usually, will be created
by calls to single.snp.tests
snp.names
:The names of the SNPs tested, as they
appear as column names in the original SnpMatrix
chisq
:A two-column matrix holding the 1 and 2 df association tests
N
:The numbers of observations included in each test
N.r2
:For tests on imputed SNPs, the product of N
and
the imputation . Otherwise a zero-length object
U
:(class "SingleSnpTestsScore"
) Score statistics
V
:(class "SingleSnpTestsScore"
) Score variances
signature(x = "SingleSnpTests", i = "ANY")
:
Subsetting operator
signature(x = "SingleSnpTestsScore", i = "ANY")
:
Subsetting operator
signature(x = "SingleSnpTests", df =
"numeric")
:
Extract 1- and 2-df chi-squared test values
signature(x = "SingleSnpTestsScore", simplify =
"missing")
: Extract signs of associations tested by the 1df tests
signature(x="SingleSnpTests")
: Extract names of
test values (snp.names
slot)
signature(x = "SingleSnpTests", df =
"numeric")
:
Evaluate 1- and 2-df test p-values
signature(object = "SingleSnpTests")
:
List all tests and p-values
signature(from = "SingleSnpTests", to = "data.frame")
:
Conversion to data frame class
signature(object = "SingleSnpTests")
:
Extract sample sizes for tests
signature(object = "SingleSnpTests")
:
Extract effective sample sizes for tests. For imputed tests, these
are the real sample sizes multiplied by the corresponding
R-squared values for imputation
signature(object = "SingleSnpTests")
:
Summarize all tests and p-values
signature(x = "SingleSnpTestsScore",
y = "SingleSnpTestsScore", score = "logical")
: Combine two
sets of test results. Used recursively by pool
signature(x = "SingleSnpTestsScore", snps =
"ANY")
: Emulate, in the score vector and its (co)variances,
the effect of switching of the alleles for the specified tests
David Clayton [email protected]
showClass("SingleSnpTests") showClass("SingleSnpTestsScore")
showClass("SingleSnpTests") showClass("SingleSnpTestsScore")
For quality control purposes, it is sometimes necessary to compare genotype data derived from different sources. This function facilitates this.
sm.compare(obj1, obj2, row.wise = TRUE, col.wise = TRUE)
sm.compare(obj1, obj2, row.wise = TRUE, col.wise = TRUE)
obj1 |
The first of the two |
obj2 |
The second |
row.wise |
Calculate comparison statistics aggregated in a row-wise manner |
col.wise |
Calculate column-wise comparison statistics |
Initially row and column names of the two objects are compared to identify subsets of subjects and SNPs which they have in common. Then, every instance of a SNP genotype in the two objects are compared and agreements and disagreements counted by row and/or by column.
If only one of the row-wise and column-wise summaries are to be calculated, the return value is a matrix with rows defined by subjects or SNPs and columns giving counts of:
Agree |
Agreements (all) |
Disagree |
Disgreements (all) |
NA.agree |
Genotype coded |
NA.disagree |
Genotype coded |
Hom.agree |
Objects agree and genotype is homozygous |
Hom.switch |
Genotype coded as homozygous in both objects, but alleles switched |
Het.agree |
Genotype coded as heterozygous in both objects |
Het.Hom |
Genotype coded as heterozygous in one object and homozygous in the other |
If both row-wise and column-wise summaries are computed (the default
behaviour) , the function
returns a list containing two matrices of the form described
above. These are named row.wise
and col.wise
No special provision is yet made for objects of class
XSnpMatrix
, in which haploid calls are coded as homozygous.
David Clayton [email protected]
SnpMatrix-class
, XSnpMatrix-class
## ## No example yet available ##
## ## No example yet available ##
This function calculates Pearson correlation coefficients between columns of a
SnpMatrix
and columns of an ordinary matrix. The two matrices
must have the same number of rows. All valid pairs are used in the
computation of each correlation coefficient.
snp.cor(x, y, uncertain = FALSE)
snp.cor(x, y, uncertain = FALSE)
x |
An N by M |
y |
An N by P general matrix |
uncertain |
If |
This can be used together with xxt
and
eigen
to calculate standardized loadings in the principal
components
An M by P matrix of correlation coefficients
This version cannot handle X chromosomes
David Clayton [email protected]
# make a SnpMatrix with a small number of rows data(testdata) small <- Autosomes[1:100,] # Calculate the X.X-transpose matrix xx <- xxt(small, correct.for.missing=TRUE) # Calculate the principal components pc <- eigen(xx, symmetric=TRUE)$vectors # Calculate the loadings in first 10 components */ loadings <- snp.cor(small, pc[,1:10])
# make a SnpMatrix with a small number of rows data(testdata) small <- Autosomes[1:100,] # Calculate the X.X-transpose matrix xx <- xxt(small, correct.for.missing=TRUE) # Calculate the principal components pc <- eigen(xx, symmetric=TRUE)$vectors # Calculate the loadings in first 10 components */ loadings <- snp.cor(small, pc[,1:10])
Given two set of SNPs typed in the same subjects, this function calculates rules which can be used to impute one set from the other in a subsequent sample. The function can also calculate rules for imputing each SNP in a single dataset from other SNPs in the same dataset
snp.imputation(X, Y, pos.X, pos.Y, phase=FALSE, try=50, stopping=c(0.95, 4, 0.05), use.hap=c(1.0, 0.0), em.cntrl=c(50,0.01,10,0.01), minA=5)
snp.imputation(X, Y, pos.X, pos.Y, phase=FALSE, try=50, stopping=c(0.95, 4, 0.05), use.hap=c(1.0, 0.0), em.cntrl=c(50,0.01,10,0.01), minA=5)
X |
An object of class |
Y |
An object of same class as |
pos.X |
The positions of the predictor SNPs. Can be missing if
there is no |
pos.Y |
The positions of the target SNPs. Only required when
a |
phase |
See "Details" below |
try |
The number of potential predictor SNPs to be
considered in the stepwise regression procedure around each target
SNP . The nearest |
stopping |
Parameters of the stopping rule for the stepwise regression (see below) |
use.hap |
Parameters to control use of the haplotype imputation method (see below) |
em.cntrl |
Parameters to control test for convergence of EM algorithm for fitting phased haplotypes (see below) |
minA |
A minimum data quantity measure for estimating pairwise linkage disequilibrium (see below) |
The routine first carries out a series of step-wise least-square
regression analyses in
which each Y SNP is regressed on the nearest try
predictor (X)
SNPs. If
phase
is TRUE
, the regressions will be calculated at the
chromosome (haplotype) level, variances being simply and
covariances estimated from the estimated two-locus haplotypes (this option is
not yet implemented). Otherwise, the
analysis is carried out at the genotype level based on
conventional variance and covariance estimates using the
"pairwise.complete.obs"
missing value treatment
(see cov
). New
SNPs are added to the regression until either (a) the value of
exceeds the first parameter of
stopping
, (b) the
number of "tag" SNPs has reached the maximum set in the second parameter of
stopping
, or (c) the change in does not achieve the
target set by the third parameter of
stopping
. If the third
parameter of stopping
is NA
, this last test is replaced
by a test for improvement in the Akaike information criterion
(AIC).
After choosing the set of "tag" SNPs in this way, a prediction
rule is generated either by calculating phased haplotype frequencies,
either (a) under a log-linear model for linkage disequilibrium with
only first order association terms fitted, or (b) under the
"saturated" model.
These methods do not differ if there is only
one tag SNP but, otherwise, choice between methods is controlled
by the use.hap
parameters.
If the prediction, as measure by achieved with the
log-linear smoothing model exceeds a
threshold (the first parameter of
use.hap
)
then this method is used. Otherwise, if the gain in
achieved by using the second method exceeds the second parameter of
use.hap
, then the second method is used.
Current experience is that, the log-linear method is rarely
preferred with reasonable choices for use.hap
, and imputation
is much faster when the second method only is considered.
The current default ensures that this second method is used,
but the other possibility might be considered if imputing
from very small samples; however this code is not extensively tested
and should be regarded as experimental.
The argument em.cntrl
controls convergence
testing for the EM algorithm for fitting haplotype frequencies and the
IPF algorithm for fitting the log-linear model. The
first parameter is the maximum number of EM iterations, and the second
parameter is the threshold for the change in log likelihood
below which the iteration is judged to have converged. The third and
fourth parameters give the maximum number of IPF iterations and the
convergence tolerance. There should be no need to change the default
values.
All SNPs selected for imputation must have sufficient data for
estimating pairwise linkage disequilibrium with each other and with
the target SNP. The statistic chosen is based on the four-fold tables
of two-locus haplotype frequencies. If the frequencies in such a table
are labelled and
then, if
then
and, otherwise,
. The cell
frequencies
must exceed
minA
for all pairwise
comparisons.
An object of class
"ImputationRules"
.
The phase=TRUE
option is not yet implemented
David Clayton [email protected]
Chapman J.M., Cooper J.D., Todd J.A. and Clayton D.G. (2003) Human Heredity, 56:18-31.
Wallace, C. et al. (2010) Nature Genetics, 42:68-71
ImputationRules-class
,
imputation.maf
, imputation.r2
# Remove 5 SNPs from a datset and derive imputation rules for them data(for.exercise) sel <- c(20, 1000, 2000, 3000, 5000) to.impute <- snps.10[,sel] impute.from <- snps.10[,-sel] pos.to <- snp.support$position[sel] pos.fr <- snp.support$position[-sel] imp <- snp.imputation(impute.from, to.impute, pos.fr, pos.to)
# Remove 5 SNPs from a datset and derive imputation rules for them data(for.exercise) sel <- c(20, 1000, 2000, 3000, 5000) to.impute <- snps.10[,sel] impute.from <- snps.10[,-sel] pos.to <- snp.support$position[sel] pos.fr <- snp.support$position[-sel] imp <- snp.imputation(impute.from, to.impute, pos.fr, pos.to)
Under the assumption of Hardy-Weinberg equilibrium, a SNP genotype is
a binomial variate with two trials for an autosomal SNP or with one or
two trials (depending on sex) for a SNP on the X chromosome.
With each SNP in an input
"SnpMatrix"
as dependent variable, this function fits a
logistic regression model. The Hardy-Weinberg
assumption can be relaxed by use of a "robust" option.
snp.lhs.estimates(snp.data, base.formula, add.formula, subset, snp.subset, data = sys.parent(), robust = FALSE, uncertain = FALSE, control=glm.test.control())
snp.lhs.estimates(snp.data, base.formula, add.formula, subset, snp.subset, data = sys.parent(), robust = FALSE, uncertain = FALSE, control=glm.test.control())
snp.data |
The SNP data, as an object of class
|
base.formula |
A |
add.formula |
A |
subset |
An array describing the subset of observations to be considered |
snp.subset |
An array describing the subset of SNPs to be considered. Default action is to test all SNPs. |
data |
The data frame in which |
robust |
If |
uncertain |
If |
control |
An object giving parameters for the IRLS algorithm
fitting of the base model and for the acceptable aliasing amongst
new terms to be tested. See |
The model fitted is the union of the base.formula
and
add.formula
models, although parameter estimates (and their
variance-covariance matrix) are only
generated for the parameters of the latter.
The "robust" option causes a Huber-White "sandwich" estimate of the
variance-covariance matrix to be used in place of the usual inverse
second derivative matrix of the log-likelihood (which assumes
Hardy-Weinberg equilibrium).
If a data
argument is supplied, the snp.data
and
data
objects are aligned by rowname. Otherwise all variables in
the model formulae are assumed to be stored in the same order as the
columns of the snp.data
object.
An object of class GlmEstimates
A factor (or
several factors) may be included as arguments to the function
strata(...)
in the base.formula
. This fits all
interactions of the factors so included, but leads to faster
computation than fitting these in the normal way. Additionally, a
cluster(...)
call may be included in the base model
formula. This identifies clusters of potentially correlated
observations (e.g. for members of the same family); in this case, an
appropriate robust estimate of the variance-covariance matrix of
parameter estimates is calculated. No more than one
strata()
call may be used, and neither strata(...)
or
cluster(...)
calls may appear in the add.formula
.
If uncertain genotypes (e.g. as a result of imputation) are used, the interpretation of the regression coefficients is questionable.
A known bug is that the function fails when no data
argument is
supplied and the base model formula contains no variables
(~1
). A work-round is to create a data frame to hold the
variables in the models and pass this as data=
.
David Clayton [email protected]
GlmEstimates-class
, snp.lhs.tests
data(testdata) test1 <- snp.lhs.estimates(Autosomes[,1:10], ~cc, ~region, data=subject.data) test2 <- snp.lhs.estimates(Autosomes[,1:10], ~strata(region), ~cc, data=subject.data) test3 <- snp.lhs.estimates(Autosomes[,1:10], ~cc, ~region, data=subject.data, robust=TRUE) test4 <- snp.lhs.estimates(Autosomes[,1:10], ~strata(region), ~cc, data=subject.data, robust=TRUE) test5 <- snp.lhs.estimates(Autosomes[,1:10], ~region+sex, ~cc, data=subject.data, robust=TRUE) print(test1) print(test2) print(test3) print(test4) print(test5)
data(testdata) test1 <- snp.lhs.estimates(Autosomes[,1:10], ~cc, ~region, data=subject.data) test2 <- snp.lhs.estimates(Autosomes[,1:10], ~strata(region), ~cc, data=subject.data) test3 <- snp.lhs.estimates(Autosomes[,1:10], ~cc, ~region, data=subject.data, robust=TRUE) test4 <- snp.lhs.estimates(Autosomes[,1:10], ~strata(region), ~cc, data=subject.data, robust=TRUE) test5 <- snp.lhs.estimates(Autosomes[,1:10], ~region+sex, ~cc, data=subject.data, robust=TRUE) print(test1) print(test2) print(test3) print(test4) print(test5)
Under the assumption of Hardy-Weinberg equilibrium, a SNP genotype is
a binomial variate with two trials for an autosomal SNP or with one or
two trials (depending on sex) for a SNP on the X chromosome.
With each SNP in an input
"SnpMatrix"
as dependent variable, this function first fits a
"base" logistic regression model and then carries out a score test for
the addition of further term(s). The Hardy-Weinberg
assumption can be relaxed by use of a "robust" option.
snp.lhs.tests(snp.data, base.formula, add.formula, subset, snp.subset, data = sys.parent(), robust = FALSE, uncertain = FALSE, control=glm.test.control(), score=FALSE)
snp.lhs.tests(snp.data, base.formula, add.formula, subset, snp.subset, data = sys.parent(), robust = FALSE, uncertain = FALSE, control=glm.test.control(), score=FALSE)
snp.data |
The SNP data, as an object of class
|
base.formula |
A |
add.formula |
A |
subset |
An array describing the subset of observations to be considered |
snp.subset |
An array describing the subset of SNPs to be considered. Default action is to test all SNPs. |
data |
The data frame in which |
robust |
If |
uncertain |
If |
control |
An object giving parameters for the IRLS algorithm
fitting of the base model and for the acceptable aliasing amongst
new terms to be tested. See |
score |
Is extended score information to be returned? |
The tests used are asymptotic chi-squared tests based on the vector of
first and second derivatives of the log-likelihood with respect to the
parameters of the additional model. The "robust" form is a generalized
score test in the sense discussed by Boos(1992).
If a data
argument is supplied, the snp.data
and
data
objects are aligned by rowname. Otherwise all variables in
the model formulae are assumed to be stored in the same order as the
columns of the snp.data
object.
An object of class snp.tests.glm
or GlmTests.score
depending on whether score
is set to FALSE
or TRUE
in the call.
A factor (or
several factors) may be included as arguments to the function
strata(...)
in the base.formula
. This fits all
interactions of the factors so included, but leads to faster
computation than fitting these in the normal way. Additionally, a
cluster(...)
call may be included in the base model
formula. This identifies clusters of potentially correlated
observations (e.g. for members of the same family); in this case, an
appropriate robust estimate of the variance of the score test is used.
No more than one
strata()
call may be used, and neither strata(...)
or
cluster(...)
calls may appear in the add.formula
.
A known bug is that the function fails when no data
argument is
supplied and the base model formula contains no variables
(~1
). A work-round is to create a data frame to hold the
variables in the models and pass this as data=
.
David Clayton [email protected]
Boos, Dennis D. (1992) On generalized score tests. The American Statistician, 46:327-333.
GlmTests-class
,
GlmTestsScore-class
,
glm.test.control
,snp.rhs.tests
single.snp.tests
, SnpMatrix-class
,
XSnpMatrix-class
data(testdata) snp.lhs.tests(Autosomes[,1:10], ~cc, ~region, data=subject.data) snp.lhs.tests(Autosomes[,1:10], ~strata(region), ~cc, data=subject.data)
data(testdata) snp.lhs.tests(Autosomes[,1:10], ~cc, ~region, data=subject.data) snp.lhs.tests(Autosomes[,1:10], ~strata(region), ~cc, data=subject.data)
These functions first standardize the input SnpMatrix
in the
same way as does the function xxt
. The standardized
matrix is then either pre-multiplied (snp.pre.multiply
) or
post-multiplied (snp.post.multiply
) by a general matrix. Allele
frequencies for standardizing the input SnpMatrix may be supplied
but, otherwise, are calculated from the input SnpMatrix
snp.pre.multiply(snps, mat, frequency=NULL, uncertain = FALSE) snp.post.multiply(snps, mat, frequency=NULL, uncertain = FALSE)
snp.pre.multiply(snps, mat, frequency=NULL, uncertain = FALSE) snp.post.multiply(snps, mat, frequency=NULL, uncertain = FALSE)
snps |
An object of class |
mat |
A general (numeric) matrix |
frequency |
A numeric vector giving the allele (relative)
frequencies to be used for standardizing the columns of |
uncertain |
If |
The two matrices must be conformant, as with standard matrix multiplication. The main use envisaged for these functions is the calculation of factor loadings in principal component analyses of large scale SNP data, and the application of these loadings to other datasets. The use of externally supplied allele frequencies for standardizing the input SnpMatrix is required when applying loadings calculated from one dataset to a different dataset
The resulting matrix product
David Clayton [email protected]
##-- ##-- Calculate first two principal components and their loading, and verify ##-- # Make a SnpMatrix with a small number of rows data(testdata) small <- Autosomes[1:20,] # Calculate the X.X-transpose matrix xx <- xxt(small, correct.for.missing=FALSE) # Calculate the first two principal components and corresponding eigenvalues eigvv <- eigen(xx, symmetric=TRUE) pc <- eigvv$vectors[,1:2] ev <- eigvv$values[1:2] # Calculate loadings for first two principal components Dinv <- diag(1/sqrt(ev)) loadings <- snp.pre.multiply(small, Dinv %*% t(pc)) # Now apply loadings back to recalculate the principal components pc.again <- snp.post.multiply(small, t(loadings) %*% Dinv) print(cbind(pc, pc.again))
##-- ##-- Calculate first two principal components and their loading, and verify ##-- # Make a SnpMatrix with a small number of rows data(testdata) small <- Autosomes[1:20,] # Calculate the X.X-transpose matrix xx <- xxt(small, correct.for.missing=FALSE) # Calculate the first two principal components and corresponding eigenvalues eigvv <- eigen(xx, symmetric=TRUE) pc <- eigvv$vectors[,1:2] ev <- eigvv$values[1:2] # Calculate loadings for first two principal components Dinv <- diag(1/sqrt(ev)) loadings <- snp.pre.multiply(small, Dinv %*% t(pc)) # Now apply loadings back to recalculate the principal components pc.again <- snp.post.multiply(small, t(loadings) %*% Dinv) print(cbind(pc, pc.again))
This function fits a generalized linear model with phenotype as dependent variable and with a series of SNPs (or small sets of SNPs) as predictor variables. Optionally, one or more potential confounders of a phenotype-genotype association may be included in the model. In order to protect against misspecification of the variance function, "robust" estimates of the variance-covariance matrix of estimates may be calculated in place of the usual model-based estimates.
snp.rhs.estimates(formula, family = "binomial", link, weights, subset, data = parent.frame(), snp.data, rules = NULL, sets = NULL, robust = FALSE, uncertain = FALSE, control = glm.test.control())
snp.rhs.estimates(formula, family = "binomial", link, weights, subset, data = parent.frame(), snp.data, rules = NULL, sets = NULL, robust = FALSE, uncertain = FALSE, control = glm.test.control())
formula |
The model formula, with phenotype as dependent variable and any potential confounders as independent variables. Note that parameter estimates are not returned for these model terms |
family |
A string defining the generalized linear model
family. This currently should (partially) match one of
|
link |
A string defining the link function for the GLM. This
currently should (partially) match one of |
data |
The dataframe in which the model formula is to be interpreted |
snp.data |
An object of class |
rules |
Optionally, an object of class |
sets |
Either a vector of SNP names (or numbers) for the SNPs
to be added to the model formula, or a logical vector of length
equal to the number of columns in |
weights |
"Prior" weights in the generalized linear model |
subset |
Array defining the subset of rows of |
robust |
If |
uncertain |
If |
control |
An object giving parameters for the IRLS algorithm
fitting of the base model and for the acceptable aliasing amongst
new terms to be tested. See |
Homozygous SNP genotypes are coded 0 or 2 and heterozygous genotypes are coded 1. For SNPs on the X chromosome, males are coded as homozygous females. For X SNPs, it will often be appropriate to include sex of subject in the base model (this is not done automatically). The "robust" option causes Huber-White estimates of the variance-covariance matrix of the parameter estimates to be returned. These protect against mis-specification of the variance function in the GLM, for example if binary or count data are overdispersed,
If a data
argument is supplied, the snp.data
and
data
objects are aligned by rowname. Otherwise all variables in
the model formulae are assumed to be stored in the same order as the
columns of the snp.data
object.
Usually SNPs to be fitted in models will be referenced by name. However,
they can
also be referenced by number, indicating the
appropriate column in the input snp.data
. They can also
be referenced by a logical selection vector of length equal to the
number of columns in snp.data
.
If the rules
argument is supplied, SNPs may be imputed using
these rules and included in the model.
An object of class GlmEstimates
A factor (or
several factors) may be included as arguments to the function
strata(...)
in the formula
. This fits all
interactions of the factors so included, but leads to faster
computation than fitting these in the normal way. Additionally, a
cluster(...)
call may be included in the base model
formula. This identifies clusters of potentially correlated
observations (e.g. for members of the same family); in this case, an
appropriate robust estimate of the variance of the parameter
estimates is used.
If uncertain genotypes (e.g. as a result of imputation) are used, the interpretation of the regression coefficients is questionable; the regression coefficient for an imperfectly measurement of a variable is not a biased (attenuated) estimate of the coefficient of the variable measured.
David Clayton [email protected]
GlmEstimates-class
,
snp.lhs.estimates
,
snp.rhs.tests
,
SnpMatrix-class
, XSnpMatrix-class
data(testdata) test <- snp.rhs.estimates(cc~strata(region), family="binomial", data=subject.data, snp.data= Autosomes, sets=1:10) print(test) test2 <- snp.rhs.estimates(cc~region+sex, family="binomial", data=subject.data, snp.data= Autosomes, sets=1:10) print(test2) test.robust <- snp.rhs.estimates(cc~strata(region), family="binomial", data=subject.data, snp.data= Autosomes, sets=1:10, robust=TRUE) print(test.robust)
data(testdata) test <- snp.rhs.estimates(cc~strata(region), family="binomial", data=subject.data, snp.data= Autosomes, sets=1:10) print(test) test2 <- snp.rhs.estimates(cc~region+sex, family="binomial", data=subject.data, snp.data= Autosomes, sets=1:10) print(test2) test.robust <- snp.rhs.estimates(cc~strata(region), family="binomial", data=subject.data, snp.data= Autosomes, sets=1:10, robust=TRUE) print(test.robust)
This function fits a generalized linear model with phenotype as dependent variable and, optionally, one or more potential confounders of a phenotype-genotype association as independent variable. A series of SNPs (or small groups of SNPs) are then tested for additional association with phenotype. In order to protect against misspecification of the variance function, "robust" tests may be selected.
snp.rhs.tests(formula, family = "binomial", link, weights, subset, data = parent.frame(), snp.data, rules=NULL, tests=NULL, robust = FALSE, uncertain=FALSE, control=glm.test.control(), allow.missing=0.01, score=FALSE)
snp.rhs.tests(formula, family = "binomial", link, weights, subset, data = parent.frame(), snp.data, rules=NULL, tests=NULL, robust = FALSE, uncertain=FALSE, control=glm.test.control(), allow.missing=0.01, score=FALSE)
formula |
The base model formula, with phenotype as dependent variable |
family |
A string defining the generalized linear model
family. This currently should (partially) match one of
|
link |
A string defining the link function for the GLM. This
currently should (partially) match one of |
data |
The dataframe in which the base model is to be fitted |
snp.data |
An object of class |
rules |
An object of class
|
tests |
Either a vector of SNP names (or numbers) for the SNPs
to be tested, or a logical vector of length equal to the number of
columns in |
weights |
"Prior" weights in the generalized linear model |
subset |
Array defining the subset of rows of |
robust |
If |
uncertain |
If |
control |
An object giving parameters for the IRLS algorithm
fitting of the base model and for the acceptable aliasing amongst
new terms to be tested. See |
allow.missing |
The maximum proportion of SNP genotype that can be missing before it becomes necessary to refit the base model |
score |
Is extended score information to be returned? |
The tests used are asymptotic chi-squared tests based on the vector of first and second derivatives of the log-likelihood with respect to the parameters of the additional model. The "robust" form is a generalized score test in the sense discussed by Boos(1992). The "base" model is first fitted, and a score test is performed for addition of one or more SNP genotypes to the model. Homozygous SNP genotypes are coded 0 or 2 and heterozygous genotypes are coded 1. For SNPs on the X chromosome, males are coded as homozygous females. For X SNPs, it will often be appropriate to include sex of subject in the base model (this is not done automatically).
If a data
argument is supplied, the snp.data
and
data
objects are aligned by rowname. Otherwise all variables in
the model formulae are assumed to be stored in the same order as the
columns of the snp.data
object.
Usually SNPs to be used in tests will be referenced by name. However,
they can
also be referenced by number, a positive number indicating the
appropriate column in the input snp.data
, and a negative number
indicating (minus) a position in the rules
list. They can also
be referenced by a logical selection vector of length equal to the
number of columns in snp.data
. Sets of tests
involving more than one SNP are referenced by a list and
can use a mixture of observed and imputed
SNPs. If the tests
argument is missing, single SNP tests are
carried out; if a rules
is given, all imputed SNP tests
are calculated, otherwise all SNPs in the input snp.data
matrix
are tested. But note that, for single SNP tests, the function
single.snp.tests
will often achieve the same
result much faster.
An object of class GlmTests
or GlmTestsScore
depending on whether score
is set to FALSE
or TRUE
in the call.
A factor (or
several factors) may be included as arguments to the function
strata(...)
in the formula
. This fits all
interactions of the factors so included, but leads to faster
computation than fitting these in the normal way. Additionally, a
cluster(...)
call may be included in the base model
formula. This identifies clusters of potentially correlated
observations (e.g. for members of the same family); in this case, an
appropriate robust estimate of the variance of the score test is used.
David Clayton [email protected]
Boos, Dennis D. (1992) On generalized score tests. The American Statistician, 46:327-333.
GlmTests-class
,
GlmTestsScore-class
,
single.snp.tests
, snp.lhs.tests
,
impute.snps
, ImputationRules-class
,
SnpMatrix-class
, XSnpMatrix-class
data(testdata) slt3 <- snp.rhs.tests(cc~strata(region), family="binomial", data=subject.data, snp.data= Autosomes, tests=1:10) print(slt3)
data(testdata) slt3 <- snp.rhs.tests(cc~strata(region), family="binomial", data=subject.data, snp.data= Autosomes, tests=1:10) print(slt3)
This class defines objects holding large arrays of single nucleotide polymorphism (SNP) genotypes generated using array technologies.
Objects can be created by calls of the form new("SnpMatrix", x)
where x
is a matrix with storage mode "raw"
.
Chips (usually corresponding to samples or
subjects) define rows of the matrix while polymorphisms (loci) define
columns. Rows and columns will usually have names which can be
used to link the data to further data concerning samples and SNPs
.Data
:Object of class "matrix"
and storage
mode raw
Internally, missing data are coded 0 and SNP genotypes are coded
1, 2 or 3. Imputed values may not be known exactly. Such
uncertain calls are grouped by probability and represented by codes
4 to 253
Class "matrix"
, from data part.
Class "structure"
, by class "matrix"
.
Class "array"
, by class "matrix"
.
Class "vector"
, by class "matrix", with explicit coerce.
Class "vector"
, by class "matrix", with explicit coerce.
signature(x = "SnpMatrix", i = "ANY", j = "ANY",
drop = "missing")
: subset
operations
signature(x = "SnpMatrix", y = "SnpMatrix")
:
S4 generic function to provide cbind() for two or
more matrices together by column. Row names must match and column
names must not coincide. If the matrices are of the derived class
XSnpMatrix-class
, the diploid
slot values must also
agree
signature(from = "SnpMatrix", to = "numeric")
:
map to numeric values 0, 1, 2 or, for uncertain assignments, to
the posterior expectation of the 0, 1, 2 code
signature(from = "SnpMatrix", to =
"character")
: map to codes "A/A", "A/B", "B/B", ""
signature(from = "matrix", to = "SnpMatrix")
:
maps numeric matrix (coded 0, 1, 2 or NA) to a SnpMatrix
signature(from = "SnpMatrix", to =
"XSnpMatrix")
:
maps a SnpMatrix to an XSnpMatrix. Ploidy is inferred from the
genotype data since haploid genotypes should always be coded as
homozygous.
After inferring ploidy, heterozygous calls for haploid genotypes
are set to NA
signature(x = "SnpMatrix")
: returns a logical
matrix indicating whether each element is NA
signature(x = "SnpMatrix", y = "snp.matrix")
:
S4 generic function to provide rbind() for two or
more matrices by row. Column names must match and duplicated row
names prompt warnings
signature(object = "SnpMatrix")
: shows the size
of the matrix (since most objects will be too large to show in full)
signature(object = "SnpMatrix")
: returns
summaries of the data frames returned by
row.summary
and col.summary
signature(x = "SnpMatrix")
: returns a logical
matrix of missing call indicators
signature(x = "SnpMatrix", snps
="ANY")
: Recode specified columns of of the matrix to reflect
allele switches
This class requires at least version 2.3 of R
David Clayton [email protected]
data(testdata) summary(Autosomes) # Just making it up - 3-10 will be made into NA during conversion snps.class<-new("SnpMatrix", matrix(1:10)) snps.class if(!isS4(snps.class)) stop("constructor is not working") pretend.X <- as(Autosomes, 'XSnpMatrix') if(!isS4(pretend.X)) stop("coersion to derived class is not S4") if(class(pretend.X) != 'XSnpMatrix') stop("coersion to derived class is not working") pretend.A <- as(Xchromosome, 'SnpMatrix') if(!isS4(pretend.A)) stop("coersion to base class is not S4") if(class(pretend.A) != 'SnpMatrix') stop("coersion to base class is not working") # display the first 10 snps of the first 10 samples print(as(Autosomes[1:10,1:10], 'character')) # convert the empty strings (no-calls) explicitly to "NC" before # writing to an (anonymous and temporary) csv file csvfile <- tempfile() write.csv(file=csvfile, gsub ('^$', 'NC', as(Autosomes[1:10,1:10], 'character') ), quote=FALSE) unlink(csvfile)
data(testdata) summary(Autosomes) # Just making it up - 3-10 will be made into NA during conversion snps.class<-new("SnpMatrix", matrix(1:10)) snps.class if(!isS4(snps.class)) stop("constructor is not working") pretend.X <- as(Autosomes, 'XSnpMatrix') if(!isS4(pretend.X)) stop("coersion to derived class is not S4") if(class(pretend.X) != 'XSnpMatrix') stop("coersion to derived class is not working") pretend.A <- as(Xchromosome, 'SnpMatrix') if(!isS4(pretend.A)) stop("coersion to base class is not S4") if(class(pretend.A) != 'SnpMatrix') stop("coersion to base class is not working") # display the first 10 snps of the first 10 samples print(as(Autosomes[1:10,1:10], 'character')) # convert the empty strings (no-calls) explicitly to "NC" before # writing to an (anonymous and temporary) csv file csvfile <- tempfile() write.csv(file=csvfile, gsub ('^$', 'NC', as(Autosomes[1:10,1:10], 'character') ), quote=FALSE) unlink(csvfile)
This is a generic function which can be applied to objects of class
"SnpMatrix"
or "XSnpMatrix"
(which hold SNP genotype
data), or to objects of class
"SingleSnpTestsScore"
or "GlmTests"
(which hold
association test results). In the former case, specified SNPs can be
recoded as if the alleles were switched (so that AA genotypes
become BB and
vice-versa while AB remain unchanged). In the latter case,
test results are modified as if alleles had been switched.
switch.alleles(x, snps)
switch.alleles(x, snps)
x |
The input object, of class |
snps |
A vector of type integer, character or logical specifying the SNP to have its alleles switched |
An object of the same class as the input object
Switching alleles for SNPs has no effect on test results. These functions are required when carrying out meta-analysis, bringing together several sets of results. It is then important that alleles line up in the datasets to be combined. It is often more convenient (and faster) to apply this process to the test result objects rather than to the genotype data themselves.
David Clayton [email protected]
SnpMatrix-class
,
XSnpMatrix-class
,
SingleSnpTests-class
,
GlmTests-class
data(testdata) which <- c("173774", "173811") Asw <- switch.alleles(Autosomes, which) col.summary(Autosomes[,which]) col.summary(Asw[,which])
data(testdata) which <- c("173774", "173811") Asw <- switch.alleles(Autosomes, which) col.summary(Autosomes[,which]) col.summary(Asw[,which])
Given large-scale SNP data for families comprising both parents and one or more affected offspring, this function computes 1 df tests (the TDT test) and a 2 df test based on observed and expected transmissions of genotypes. Tests based on imputation rules can also be carried out.
tdt.snp(ped, id, father, mother, affected, data = sys.parent(), snp.data, rules = NULL, snp.subset, check.inheritance = TRUE, robust = FALSE, uncertain = FALSE, score = FALSE)
tdt.snp(ped, id, father, mother, affected, data = sys.parent(), snp.data, rules = NULL, snp.subset, check.inheritance = TRUE, robust = FALSE, uncertain = FALSE, score = FALSE)
ped |
Pedigree identifiers |
id |
Subject identifiers |
father |
Identifiers for subjects' fathers |
mother |
Identifiers for subjects' mothers |
affected |
Disease status (TRUE if affected, FALSE otherwise) |
data |
A data frame in which to evaluate the previous five arguments |
snp.data |
An object of class |
rules |
An object of class
|
snp.subset |
A vector describing the subset of SNPs to be
considered. Default action is to test all SNPs in |
check.inheritance |
If TRUE, each affected offspring/parent trio is tested for Mendelian inheritance and excluded if the test fails. If FALSE, misinheriting trios are used but the "robust" variance option is forced |
robust |
If TRUE, forces the robust (Huber-White) variance option
(with |
uncertain |
If TRUE, uncertain genotypes are handed by replacing
score contributions by their posterior expectations. Otherwise these
are treated as missing. Setting this option authomatically invokes
use of |
score |
If |
Formally, the test statistics are score tests for the "conditioning on
parental genotype" (CPG) likelihood. Parametrization of associations is
the same as for the population-based tests calculated by
single.snp.tests
so that results from family-based and
population-based studies can be combined using pool
.
When the function is used to calculate tests for imputed SNPs, the test is still an approximate score test. The current version does not use the family relationships in the imputation. With this option, the robust variance estimate is forced.
The first five arguments are usually derived from a "pedfile". If a
data frame is supplied for the data
argument, the first five
arguments will be evaluated in this frame. Otherwise they will be evaluated
in the calling environment. If the arguments are missing, they will be
assumed to be in their usual positions in the pedfile data frame
i.e. in columns one to
four for the identifiers and column six for disease status
(with affected coded 2
). If the pedfile data are obtained from
a dataframe, the row names of the data
and snp.data
files will be used to align the pedfile and SNP data. Otherwise, these
vectors will be assumed to be in the same order as the rows of
snp.data
.
The snp.subset
argument can be a logical,
integer, or character vector.
If imputed rather than observed SNPs are tested, or
if check.inheritance
is set to
FALSE
, the "robust" variance estimate is used regardless of the
value supplied for the robust
argument.
An object of class
"SingleSnpTests"
.
If score=TRUE
, the output object will be of the extended class
"SingleSnpTestsScore"
containing additional slots holding the score statistics and their
variances (and covariances). This allows meta-analysis using the
pool
function.
When the snps are on the X chromosome (i.e. when the snp.data
argument is of class "XSnpMatrix"
), the tests are constructed
in the same way as was described by Clayton (2008) for population-based
association tests i.e. assuming that
genotype relative risks for males mirror thos of homozygous females
David Clayton [email protected]
Clayton (2008) Testing for association on the X chromosome Biostatistics, 9:593-600.)
single.snp.tests
, impute.snps
,
pool
, ImputationRules-class
,
SingleSnpTests-class
,
SingleSnpTestsScore-class
data(families) tdt.snp(data=pedData, snp.data=genotypes)
data(families) tdt.snp(data=pedData, snp.data=genotypes)
When testing genotype data derived from different platforms or scoring
algorithms a common problem is switching of alleles. This function
provides a diagnostic for this. Input can either be two objects of class
"SnpMatrix"
to be examined, column by column, for allele
switching, or a single "SnpMatrix"
object together with an
indicator vector giving group membership for its rows.
test.allele.switch(snps, snps2 = NULL, split = NULL, prior.df = 1)
test.allele.switch(snps, snps2 = NULL, split = NULL, prior.df = 1)
snps |
An object of class |
snps2 |
A second object of the same class as |
split |
If only one |
prior.df |
A degree of freedom parameter for the prior
distribution of the allele frequency |
This function calculates a Bayes factor for the comparison of the
hypothesis that the alleles have been switched with the hypothesis that
they have not been switched. This requires integration over the
posterior distribution of the allele frequency. The prior is taken as a
beta distribution with both parameters equal to prior.df
so that
the prior is symmetric about 0.5. The default,
prior.df=1
represents a uniform prior on (0,1).
A vector containing the log (base 10) of the Bayes Factors for an allele switch.
David Clayton [email protected]
SnpMatrix-class
, XSnpMatrix-class
data(testdata) # # Call with two SnpMatrix arguments # cc <- as.numeric(subject.data$cc) lbf1 <- test.allele.switch(Autosomes[cc==1,], Autosomes[cc==2,]) # # Single matrix call (giving the same result) # lbf2 <- test.allele.switch(Autosomes, split=cc)
data(testdata) # # Call with two SnpMatrix arguments # cc <- as.numeric(subject.data$cc) lbf1 <- test.allele.switch(Autosomes[cc==1,], Autosomes[cc==2,]) # # Single matrix call (giving the same result) # lbf2 <- test.allele.switch(Autosomes, split=cc)
This dataset comprises several data frames from a fictional (and unrealistically small) study. The dataset started off as real data from a screen of non-synonymous SNPs for association with type 1 diabetes, but the original identifiers have been removed and a random case/control status has been generated.
data(testdata)
data(testdata)
There are five data objects in the dataset:
Autosomes
: An object of class "SnpMatrix"
containing genotype calls for 400 subjects at 9445 autosomal SNPs
Xchromosome
: An object of class "XSnpMatrix"
containing genotype calls for 400 subjects at 155 SNPs on the X
chromosome
Asnps
: A dataframe containing information about the
autosomal SNPs. Here it contains only one variable,
chromosome
, indicating the chromosomes on which the SNPs are
located
Xsnps
: A dataframe containing information about the X
chromosome SNPs. Here it is empty and is only included for
completeness
subject.data
: A dataframe containing information about
the subjects from whom each row of SNP data was obtained. Here it
contains:
cc
: Case-control status
sex
: Sex
region
: Geographical region of residence
The data were obtained from the diabetes and inflammation laboratory (see http://www-gene.cimr.cam.ac.uk)
data(testdata) Autosomes Xchromosome summary(Asnps) summary(Xsnps) summary(subject.data) summary(summary(Autosomes)) summary(summary(Xchromosome))
data(testdata) Autosomes Xchromosome summary(Asnps) summary(Xsnps) summary(subject.data) summary(summary(Autosomes)) summary(summary(Xchromosome))
Given a SnpMatrix
object, together with associated subject and
SNP support dataframes, this function writes .bed
,
.bim
, and .fam
files for processing in the PLINK toolset
write.plink(file.base, snp.major = TRUE, snps, subject.data, pedigree, id, father, mother, sex, phenotype, snp.data, chromosome, genetic.distance, position, allele.1, allele.2, na.code = 0, human.genome=TRUE)
write.plink(file.base, snp.major = TRUE, snps, subject.data, pedigree, id, father, mother, sex, phenotype, snp.data, chromosome, genetic.distance, position, allele.1, allele.2, na.code = 0, human.genome=TRUE)
file.base |
A character string giving the base filename. The
extensions |
snp.major |
Logical variable controlling whether the |
snps |
The |
subject.data |
(Optional) A subject support dataframe. If
supplied, the next six arguments (which define the fields of the
PLINK |
pedigree |
A pedigree (family) identifier. Default is the row
names of |
id |
An identifier of an individual within family. Default is a
vector of |
father |
The within-family identifier of the subject's
father. Default is a vector of |
mother |
The within-family identifier of the subject's
mother. Default is a vector of |
sex |
Sex of the individual. Default is a vector of
|
phenotype |
The primary phenotype value. Default is a vector of
|
snp.data |
(Optional) A SNP support dataframe. If
supplied, the next five arguments (which define the columns of the
PLINK |
chromosome |
The chromosome on which the SNP is located. This
should either be numeric, as specified by SPLINK, or character, with
|
genetic.distance |
The location of the SNP, expressed as a
genetic distance. Default is a vector of |
position |
The physical location of the SNP, expressed in base pairs.
Default is a vector of |
allele.1 |
A character vector giving the first allele. Default is
a vector of |
allele.2 |
A character vector giving the first allele. Default is
a vector of |
na.code |
The code to be written for |
human.genome |
If TRUE, check the
|
For more details of required codings in .fam
and .bim
files, see the PLINK documentation.
Returns NULL
.
David Clayton [email protected]
PLINK: Whole genome association analysis toolset. http://pngu.mgh.harvard.edu/~purcell/plink/
read.plink
,
SnpMatrix-class
, XSnpMatrix-class
data(testdata) ## the use of as.numeric() below is not necessary since this is done ## automatically. It just illustrates use of expressions for these arguments ## Note that cc and sex are variables within the subject.data frame ## This command writes files test.bed. test.fam and test.bim write.plink(file.base="test", snps=Autosomes, subject.data=subject.data, phenotype = as.numeric(cc), sex=as.numeric(sex), snp.major=FALSE)
data(testdata) ## the use of as.numeric() below is not necessary since this is done ## automatically. It just illustrates use of expressions for these arguments ## Note that cc and sex are variables within the subject.data frame ## This command writes files test.bed. test.fam and test.bim write.plink(file.base="test", snps=Autosomes, subject.data=subject.data, phenotype = as.numeric(cc), sex=as.numeric(sex), snp.major=FALSE)
This function is closely modelled on write.table
. It writes an
object of class SnpMatrix
as a text file with one line for
each row of the matrix. Genotpyes are written in numerical form,
i.e. as 0, 1 or 2 (where 1 denotes heterozygous) or,
optionally, as a pair of alleles (each coded 1 or 2).
write.SnpMatrix(x, file, as.alleles= FALSE, append = FALSE, quote = TRUE, sep = " ", eol = "\n", na = "NA", row.names = TRUE, col.names = TRUE)
write.SnpMatrix(x, file, as.alleles= FALSE, append = FALSE, quote = TRUE, sep = " ", eol = "\n", na = "NA", row.names = TRUE, col.names = TRUE)
x |
The object to be written |
file |
The name of the output file |
as.alleles |
If |
append |
If |
quote |
If |
sep |
The string separating entries within a line |
eol |
The string terminating each line |
na |
The string written for missing genotypes |
row.names |
If |
col.names |
If |
A numeric vector giving the dimensions of the matrix written
David Clayton [email protected]
write.table
, SnpMatrix-class
,
XSnpMatrix-class
This class extends the SnpMatrix-class
to
deal with SNPs on the X and Y chromosomes and mitocondrial SNPs.
Objects can be created by calls of the form new("XSnpMatrix", x,
diploid)
.
Such objects have an additional slot
to objects of class
"SnpMatrix"
consisting of a logical array of the same length as the number of
rows. This array indicates whether genotypes in that row are diploid
(TRUE
) or haploid (FALSE
as, for example, SNPs on the X
chromosome for males).
.Data
:Object of class "matrix"
and storage mode
"raw"
diploid
:Object of class "logical"
indicating
sex of samples
Class "SnpMatrix"
, directly, with explicit coerce.
Class "matrix"
, by class "SnpMatrix"
.
Class "structure"
, by class "SnpMatrix"
.
Class "array"
, by class "SnpMatrix"
.
Class "vector"
, by class "SnpMatrix", with explicit coerce.
Class "vector"
, by class "SnpMatrix", with explicit coerce.
signature(x = "XSnpMatrix", i = "ANY", j = "ANY",
drop = "missing")
: subset extraction
signature(x = "XSnpMatrix", i = "ANY", j = "ANY", "XSnpMatrix")
: subset
assignment operation to replace part of an object
signature(from = "XSnpMatrix", to =
"character")
: map to codes 0, 1, 2, or NA
signature(from = "SnpMatrix", to =
"XSnpMatrix")
:
maps a SnpMatrix to an XSnpMatrix. Ploidy is inferred from the
genotype data since haploid genotypes should always be coded as
homozygous. After inferring ploidy, heterozygous calls for haploid
genotpes are set to NA
signature(object = "XSnpMatrix")
: map to codes
"A/A", "A/B", "B/B", "A", "B" or ""
signature(object = "XSnpMatrix")
: returns
the distribution of ploidy, together with
summaries of the data frames returned by
row.summary
and col.summary
David Clayton [email protected]
data(testdata) summary(Xchromosome) # display the first 10 snps of the first 10 samples print(as(Xchromosome[1:10,1:10],'character')) # convert the empty strings (no-calls) explicitly to "NC" before # writing to an (anonymous and temporary) csv file csvfile <- tempfile() write.csv(file=csvfile, gsub ('^$', 'NC', as(Xchromosome[1:10,1:10], 'character') ), quote=FALSE) unlink(csvfile)
data(testdata) summary(Xchromosome) # display the first 10 snps of the first 10 samples print(as(Xchromosome[1:10,1:10],'character')) # convert the empty strings (no-calls) explicitly to "NC" before # writing to an (anonymous and temporary) csv file csvfile <- tempfile() write.csv(file=csvfile, gsub ('^$', 'NC', as(Xchromosome[1:10,1:10], 'character') ), quote=FALSE) unlink(csvfile)
The input SnpMatrix is first standardized by subtracting the mean (or stratum mean) from each call and dividing by the expected standard deviation under Hardy-Weinberg equilibrium. It is then post-multiplied by its transpose. This is a preliminary step in the computation of principal components.
xxt(snps, strata = NULL, correct.for.missing = FALSE, lower.only = FALSE, uncertain = FALSE)
xxt(snps, strata = NULL, correct.for.missing = FALSE, lower.only = FALSE, uncertain = FALSE)
snps |
The input matrix, of type |
strata |
A |
correct.for.missing |
If |
lower.only |
If |
uncertain |
If |
This computation forms the first step of the calculation of principal
components for genome-wide SNP data. As pointed out by Price et al.
(2006), when the data matrix has more rows than columns it is
most efficient to calculate the eigenvectors of
X.X-transpose, where X is a
SnpMatrix
whose columns have been standardized to zero mean and
unit variance. For autosomes, the genotypes are given codes 0, 1 or 2
after subtraction of the mean, 2p, are divided by the standard
deviation
sqrt(2p(1-p)) (p is the estimated allele
frequency). For SNPs on the X chromosome in male subjects,
genotypes are coded 0 or 2. Then
the mean is still 2p, but the standard deviation is
2sqrt(p(1-p)). If the strata
is supplied, a
stratum-specific estimate value for p is used for
standardization.
Missing observations present some difficulty. Price et al. (2006)
recommended replacing missing observations by their means, this being
equivalent to replacement by zeros in the standardized matrix. However
this results in a biased estimate of the complete data
result. Optionally this bias can be corrected by inverse probability
weighting. We assume that the probability that any one call is missing
is small, and can be predicted by a multiplicative model with row
(subject) and column (locus) effects. The estimated probability of a
missing value in a given row and column is then given by
, where R is the row total number of
no-calls, C is the column total of no-calls, and T is the
overall total number of no-calls. Non-missing contributions to
X.X-transpose are then weighted by
for
contributions to the diagonal elements, and products of the relevant
pairs of weights for contributions to off–diagonal elements.
A square matrix containing either the complete X.X-transpose matrix, or just its lower triangle
The correction for missing observations can result in an output matrix which is not positive semi-definite. This should not matter in the application for which it is intended
In genome-wide studies, the SNP data will usually be held as a series of
objects (of
class "SnpMatrix"
or"XSnpMatrix"
), one per chromosome.
Note that the X.X-transpose matrices
produced by applying the xxt
function to each object in turn
can be added to yield the genome-wide result.
If the matrix is converted to a correlation matrix by pre- and post-multiplying by the sqrt of the inverse of its diagonal, then this is an unbiased estimate of twice the kinship matrix.
David Clayton [email protected]
Price et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38:904-9
# make a SnpMatrix with a small number of rows data(testdata) small <- Autosomes[1:100,] # Calculate the X.X-transpose matrix xx <- xxt(small, correct.for.missing=TRUE) # Calculate the principal components pc <- eigen(xx, symmetric=TRUE)$vectors
# make a SnpMatrix with a small number of rows data(testdata) small <- Autosomes[1:100,] # Calculate the X.X-transpose matrix xx <- xxt(small, correct.for.missing=TRUE) # Calculate the principal components pc <- eigen(xx, symmetric=TRUE)$vectors