Title: | Logistic Factor Analysis for Categorical Data |
---|---|
Description: | Logistic Factor Analysis is a method for a PCA analogue on Binomial data via estimation of latent structure in the natural parameter. The main method estimates genetic population structure from genotype data. There are also methods for estimating individual-specific allele frequencies using the population structure. Lastly, a structured Hardy-Weinberg equilibrium (HWE) test is developed, which quantifies the goodness of fit of the genotype data to the estimated population structure, via the estimated individual-specific allele frequencies (all of which generalizes traditional HWE tests). |
Authors: | Wei Hao [aut], Minsun Song [aut], Alejandro Ochoa [aut, cre] , John D. Storey [aut] |
Maintainer: | Alejandro Ochoa <[email protected]> |
License: | GPL (>= 3) |
Version: | 2.7.0 |
Built: | 2024-10-30 08:35:51 UTC |
Source: | https://github.com/bioc/lfa |
Compute matrix of individual-specific allele frequencies
af(X, LF, safety = FALSE, max_iter = 100, tol = 1e-10)
af(X, LF, safety = FALSE, max_iter = 100, tol = 1e-10)
X |
A matrix of SNP genotypes, i.e. an integer matrix of 0's,
1's, 2's and |
LF |
Matrix of logistic factors, with intercept.
Pass in the return value from |
safety |
Optional boolean to bypass checks on the genotype
matrices, which require a non-trivial amount of computation.
Ignored if |
max_iter |
Maximum number of iterations for logistic regression |
tol |
Numerical tolerance for convergence of logistic regression |
Computes the matrix of individual-specific allele
frequencies, which has the same dimensions of the genotype matrix.
Be warned that this function could use a ton of memory, as the
return value is all doubles. It could be wise to pass only a
selection of the SNPs in your genotype matrix to get an idea for
memory usage. Use gc()
to check memory usage!
Matrix of individual-specific allele frequencies.
LF <- lfa( hgdp_subset, 4 ) allele_freqs <- af( hgdp_subset, LF )
LF <- lfa( hgdp_subset, 4 ) allele_freqs <- af( hgdp_subset, LF )
Computes individual-specific allele frequencies for a single SNP.
af_snp(snp, LF, max_iter = 100, tol = 1e-10)
af_snp(snp, LF, max_iter = 100, tol = 1e-10)
snp |
vector of 0's, 1's, and 2's |
LF |
Matrix of logistic factors, with intercept.
Pass in the return value from |
max_iter |
Maximum number of iterations for logistic regression |
tol |
Numerical tolerance for convergence of logistic regression |
vector of allele frequencies
LF <- lfa(hgdp_subset, 4) # pick one SNP only snp <- hgdp_subset[ 1, ] # allele frequency vector for that SNO only allele_freqs_snp <- af_snp(snp, LF)
LF <- lfa(hgdp_subset, 4) # pick one SNP only snp <- hgdp_subset[ 1, ] # allele frequency vector for that SNO only allele_freqs_snp <- af_snp(snp, LF)
C routine to row-center and scale a matrix. Doesn't work with missing data.
centerscale(A)
centerscale(A)
A |
matrix |
matrix same dimensions A
but row centered and scaled
Xc <- centerscale(hgdp_subset)
Xc <- centerscale(hgdp_subset)
Subset of the HGDP dataset.
hgdp_subset
hgdp_subset
a matrix of 0's, 1's and 2's.
genotype matrix
Stanford HGDP http://www.hagsc.org/hgdp/files.html
Fit logistic factor model of dimension d
to binomial data.
Computes d - 1
singular vectors followed by intercept.
lfa( X, d, adjustments = NULL, override = FALSE, safety = FALSE, rspectra = FALSE, ploidy = 2, tol = .Machine$double.eps, m_chunk = 1000 )
lfa( X, d, adjustments = NULL, override = FALSE, safety = FALSE, rspectra = FALSE, ploidy = 2, tol = .Machine$double.eps, m_chunk = 1000 )
X |
A matrix of SNP genotypes, i.e. an integer matrix of 0's,
1's, 2's and |
d |
Number of logistic factors, including the intercept |
adjustments |
A matrix of adjustment variables to hold fixed during
estimation. Number of rows must equal number of individuals in |
override |
Optional boolean passed to |
safety |
Optional boolean to bypass checks on the genotype
matrices, which require a non-trivial amount of computation.
Ignored if |
rspectra |
If |
ploidy |
Ploidy of data, defaults to 2 for bi-allelic unphased SNPs |
tol |
Tolerance value passed to |
m_chunk |
If |
Genotype matrix should have values in 0, 1, 2, or NA
.
The coding of the SNPs (which case is 0 vs 2) does not change the output.
The matrix of logistic factors, with individuals along rows and factors along columns. The intercept appears at the end of the columns, and adjustments in the beginning if present.
LF <- lfa(hgdp_subset, 4) dim(LF) head(LF)
LF <- lfa(hgdp_subset, 4) dim(LF) head(LF)
Compute matrix of individual-specific allele frequencies via PCA
pca_af(X, d, override = FALSE, ploidy = 2, tol = 1e-13, m_chunk = 1000)
pca_af(X, d, override = FALSE, ploidy = 2, tol = 1e-13, m_chunk = 1000)
X |
A matrix of SNP genotypes, i.e. an integer matrix of 0's,
1's, 2's and |
d |
Number of logistic factors, including the intercept |
override |
Optional boolean passed to |
ploidy |
Ploidy of data, defaults to 2 for bi-allelic unphased SNPs |
tol |
Tolerance value passed to |
m_chunk |
If |
This corresponds to algorithm 1 in the paper. Only used for comparison purposes.
Matrix of individual-specific allele frequencies.
LF <- lfa(hgdp_subset, 4) allele_freqs_lfa <- af(hgdp_subset, LF) allele_freqs_pca <- pca_af(hgdp_subset, 4) summary(abs(allele_freqs_lfa-allele_freqs_pca))
LF <- lfa(hgdp_subset, 4) allele_freqs_lfa <- af(hgdp_subset, LF) allele_freqs_pca <- pca_af(hgdp_subset, 4) summary(abs(allele_freqs_lfa-allele_freqs_pca))
Compute structural Hardy-Weinberg Equilibrium (sHWE) p-values
on a SNP-by-SNP basis. These p-values can be aggregated to
determine genome-wide goodness-of-fit for a particular value
of d
. See doi:10.1101/240804 for more
details.
sHWE(X, LF, B, max_iter = 100, tol = 1e-10)
sHWE(X, LF, B, max_iter = 100, tol = 1e-10)
X |
A matrix of SNP genotypes, i.e. an integer matrix of 0's,
1's, 2's and |
LF |
matrix of logistic factors |
B |
number of null datasets to generate, |
max_iter |
Maximum number of iterations for logistic regression |
tol |
Tolerance value passed to |
a vector of p-values for each SNP.
# get LFs LF <- lfa(hgdp_subset, 4) # look at a small (300) number of SNPs for rest of this example: hgdp_subset_small <- hgdp_subset[ 1:300, ] gof_4 <- sHWE(hgdp_subset_small, LF, 3) LF <- lfa(hgdp_subset, 10) gof_10 <- sHWE(hgdp_subset_small, LF, 3) hist(gof_4) hist(gof_10)
# get LFs LF <- lfa(hgdp_subset, 4) # look at a small (300) number of SNPs for rest of this example: hgdp_subset_small <- hgdp_subset[ 1:300, ] gof_4 <- sHWE(hgdp_subset_small, LF, 3) LF <- lfa(hgdp_subset, 10) gof_10 <- sHWE(hgdp_subset_small, LF, 3) hist(gof_4) hist(gof_10)
Truncated SVD
trunc_svd( A, d, adjust = 3, tol = .Machine$double.eps, override = FALSE, force = FALSE, maxit = 1000 )
trunc_svd( A, d, adjust = 3, tol = .Machine$double.eps, override = FALSE, force = FALSE, maxit = 1000 )
A |
matrix to decompose |
d |
number of singular vectors |
adjust |
extra singular vectors to calculate for accuracy |
tol |
convergence criterion |
override |
|
force |
If |
maxit |
Maximum number of iterations |
Performs singular value decomposition but only returns the first d
singular vectors/values.
The truncated SVD utilizes Lanczos bidiagonalization.
See references.
This function was modified from the package irlba 1.0.1 under GPL.
Replacing the crossprod()
calls with the C wrapper to
dgemv
is a dramatic difference in larger datasets.
Since the wrapper is technically not a matrix multiplication function, it
seemed wise to make a copy of the function.
list with singular value decomposition. Has elements 'd', 'u', 'v', and 'iter'
obj <- trunc_svd( hgdp_subset, 4 ) obj$d obj$u obj$v obj$iter
obj <- trunc_svd( hgdp_subset, 4 ) obj$d obj$u obj$v obj$iter