The BG2
package provides statistical tools for the
analysis of Poisson and Binary GWAS data. Currently, BG2
contains functions to perform BG2 which is a novel two step Bayesian
procedure that, when compared to single marker association testing
(SMA), drastically reduces the rate of false discoveries while
maintaining the same level of recall of true causal SNPs. Full details
on the BG2 procedure can be found in the manuscript (Xu, Williams, and Ferreira 202X). The two
packages in Bioconductor are GWASTools (Gogarten
et al. 2012) and GWAS.BAYES (Williams,
Ferreira, and Ji 2022). While GWASTools provides frequentist
methods, BG2 provides Bayesian methods. While GWAS.Bayes provides
methods for analysis of Gaussian data, BG2 provides methods for the
analysis of non-Gaussian data.
This vignette explores two toy examples as well as a case study
presented in the BG2 manuscript (Xu, Williams,
and Ferreira 202X) to illustrate how the functions provided in
BG2
perform the BG2 procedure. Data has been simulated
under a generalized linear mixed model from 9,000 SNPs of 328 A.
Thaliana ecotypes. The BG2
package includes as
R
objects the simulated data; 9,000 SNPs, the simulated
phenotypes (both binary and Poisson), and the kinship matrix used to
simulate the data. Further, the Github repo that contains the
BG2
package also contains contains the data for the A.
Thaliana case study.
The function implemented in BG2
is described below:
BG2
Performs BG2, as described in the BG2 manuscript
(Xu, Williams, and Ferreira 202X), using
generalized linear mixed models for a given numeric phenotype vector
either binary or assumed Poisson distributed Y
, a SNP
matrix encoded numerically SNPs
, fixed covariates
Fixed
, and random effects and their projection matrices
(covariance
and Z
respectively). The
BG2
function returns the indices of the SNP matrix that
were identified in the best model found by the BG2 procedure.The model for GWAS analysis used in the BG2
package
is
where
Currently, BG2
can analyze binary responses
(family = "bernoulli"
) and Poisson responses
(family = "poisson"
). The BG2 manuscript (Xu, Williams, and Ferreira 202X) provides full
details on the priors for β. BG2
utilizes
spectral decomposition techniques similar to that of Kang et al. (2008) as well as population
parameters previously determined (Zhang et al.
2010) to speed up computation.
The BG2
function requires a vector of observed
phenotypes (either binary or assumed Poisson distributed), a matrix of
SNPs, and the specification of the random effects. First, the vector of
observed phenotypes must be a numeric vector or a numeric n × 1 matrix. BG2
does
not allow the analysis of multiple phenotypes simultaneously. In the
BG2
package, there are two simulated phenotype vectors. The
first simulated phenotype vector comes from a Poisson generalized linear
mixed model with both a kinship random effect and an overdispersion
random effect. The data is assumed to have four replicates for each
A. Thaliana ecotype. Here are the first five elements of the
Poisson simulated vector of phenotypes:
The second simulated phenotype vector comes from a binary generalized linear mixed model with only a kinship random effect. The first five elements of the binary simulated vector of phenotypes are
Second, the SNP matrix has to contain numeric values where each column corresponds to a SNP of interest and the ith row corresponds to the ith observed phenotype. In this example, the SNPs are a subset of the A. Thaliana TAIR9 genotype dataset and all SNPs have minor allele frequency greater than 0.01. Each simulated phenotype vector is simulated using this SNP matrix. Here are the first five rows and five columns of the SNP matrix:
data("SNPs")
SNPs[1:5,1:5]
#> SNP2555 SNP2556 SNP2557 SNP2558 SNP2559
#> [1,] 1 1 1 0 0
#> [2,] 0 1 1 1 1
#> [3,] 0 0 1 1 1
#> [4,] 1 1 0 0 1
#> [5,] 1 1 1 1 1
Third, the kinship matrix is an n × n positive semi-definite matrix containing only numeric values. The ith row or ith column quantifies how observation i is related to other observations. Since, both simulated phenotypes are simulated from the same SNP matrix they also are simulated from the same kinship structure. The first five rows and five columns of the kinship matrix are
data("kinship")
kinship[1:5,1:5]
#> V1 V2 V3 V4 V5
#> [1,] 0.78515873 0.15800700 0.04264546 0.02057071 0.05643574
#> [2,] 0.15800700 0.78146476 0.05135891 0.01476357 0.05482448
#> [3,] 0.04264546 0.05135891 0.80199976 0.10558970 0.04888596
#> [4,] 0.02057071 0.01476357 0.10558970 0.80030413 0.02935703
#> [5,] 0.05643574 0.05482448 0.04888596 0.02935703 0.78401489
The function BG2
implements the BG2 method for
generalized linear mixed models with either Poisson or Bernoulli
distributed responses. This function takes inputs of the observed
phenotypes, the SNPs coded numerically, the distributional family the
response follows, fixed covariates treated as a matrix, the covariance
matrices of the random effects, the design matrices of the random
effects, the number of replicates of individuals or ecotypes you may
have, and the choice of a fixed value or a prior for the dispersion
parameter of the nonlocal prior. Further, the other inputs of
BG2
are the FDR nominal level, the maximum number of
iterations of the genetic algorithm in the model selection step, and the
number of consecutive iterations of the genetic algorithm with the same
best model for convergence. The full details of BG2 are available in the
BG2 manuscript (Xu, Williams, and Ferreira
202X). The default values of maximum iterations and the number of
iterations are the values used in the simulation study in the BG paper,
that is, 4000 and 400 respectively.
Here we illustrate the use of BG2 with a nominal FDR of 0.05 with
Poisson count data. First we specify the covariance matrices for the
random effects. The first random effect is assumed to be α1 ∼ N(0, κ1K),
where K is the realized
relationship matrix or kinship matrix. The second random effect is
assumed to be α1 ∼ N(0, κ2I),
where the covariance matrix is an identity matrix times a scalar. This
second random effect is to account for overdispersion in the Poisson
model. The Covariance
argument takes a list of random
effect covariance matrices. For this example, the list of covariance
matrices is set as:
n <- length(Y_poisson)
covariance <- list()
covariance[[1]] <- kinship
covariance[[2]] <- diag(1, nrow = n, ncol = n)
The design matrices Zi do not need
to be specified in Z
as the observations have no other
structure such as a grouping structure. Z
is set to be NULL
implying that Zi = In × n.
Further, the number of ecotype replications is 4. Finally, we let the
dispersion parameter of the nonlocal prior have a uniform prior.
set.seed(1330)
output_poisson <- BG2(Y=Y_poisson, SNPs=SNPs, Fixed = NULL,
Covariance=covariance, Z=NULL, family="poisson",
replicates=4, Tau="uniform",FDR_Nominal = 0.05,
maxiterations = 4000, runs_til_stop = 400)
output_poisson
#> [1] 385 450 462 463 1261 1328 1350 1437 2250 3150 4050 7630
BG2
outputs the column indices of the SNPs
matrix that are in best model or column indices of SNPs perfectly
correlated to SNPs in the best model. The data was generated with causal
SNPs at positions 450, 1,350, 2,250, 3,150, 4,050, 4,950, 5,850, 6,750,
7,650,and 8,550. BG2 identifies 5 causal SNPs.
Here we illustrate the use of BG2 with a nominal FDR of 0.05 with Poisson count data. First we specify the covariance matrices for the random effects. The only random effect is assumed to be α ∼ N(0, κ1K), where K is the realized relationship matrix or kinship matrix. For this example, the list of covariance matrices is set as:
In this example, the design matrices Zi do not need
to be specified in Z
as the observations have no other
structure such as a grouping structure. Z
is set to be NULL
implying that Zi = In × n.
With binary data, setting the number of replicates provides no
computation gain and is not required. Finally, we let the dispersion
parameter of the nonlocal prior have a Inverse Gamma distribution.
Details on the Inverse Gamma Distribution are provide in the BG2
manuscript (Xu, Williams, and Ferreira
202X).
set.seed(1330)
output_binary <- BG2(Y=Y_binary, SNPs=SNPs, Fixed = NULL,
Covariance=covariance, Z=NULL, family="bernoulli",
replicates=NULL, Tau="IG",FDR_Nominal = 0.05,
maxiterations = 4000, runs_til_stop = 400)
output_binary
#> [1] 450 2250 7650 8550
Similarly to the Poisson example in Section 4.1.1, the data was generated with causal SNPs at positions 450, 1,350, 2,250, 3,150, 4,050, 4,950, 5,850, 6,750, 7,650,and 8,550. BG2 identifies 4 causal SNPs and no false SNPs.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] BG2_1.7.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 xfun_0.49 bslib_0.8.0
#> [4] ggplot2_3.5.1 recipes_1.1.0 lattice_0.22-6
#> [7] vctrs_0.6.5 tools_4.4.2 generics_0.1.3
#> [10] stats4_4.4.2 parallel_4.4.2 tibble_3.2.1
#> [13] pkgconfig_2.0.3 ModelMetrics_1.2.2.2 Matrix_1.7-1
#> [16] data.table_1.16.4 lifecycle_1.0.4 compiler_4.4.2
#> [19] stringr_1.5.1 munsell_0.5.1 codetools_0.2-20
#> [22] htmltools_0.5.8.1 sys_3.4.3 buildtools_1.0.0
#> [25] class_7.3-23 sass_0.4.9 yaml_2.3.10
#> [28] prodlim_2024.06.25 crayon_1.5.3 pillar_1.10.0
#> [31] jquerylib_0.1.4 MASS_7.3-63 cachem_1.1.0
#> [34] gower_1.0.2 iterators_1.0.14 rpart_4.1.23
#> [37] foreach_1.5.2 nlme_3.1-166 parallelly_1.41.0
#> [40] lava_1.8.0 tidyselect_1.2.1 digest_0.6.37
#> [43] stringi_1.8.4 future_1.34.0 dplyr_1.1.4
#> [46] reshape2_1.4.4 purrr_1.0.2 listenv_0.9.1
#> [49] maketools_1.3.1 splines_4.4.2 fastmap_1.2.0
#> [52] grid_4.4.2 colorspace_2.1-1 cli_3.6.3
#> [55] magrittr_2.0.3 survival_3.8-3 future.apply_1.11.3
#> [58] withr_3.0.2 scales_1.3.0 lubridate_1.9.4
#> [61] timechange_0.3.0 rmarkdown_2.29 globals_0.16.3
#> [64] nnet_7.3-20 timeDate_4041.110 GA_3.2.4
#> [67] memoise_2.0.1 evaluate_1.0.1 knitr_1.49
#> [70] hardhat_1.4.0 caret_7.0-1 rlang_1.1.4
#> [73] Rcpp_1.0.13-1 glue_1.8.0 BiocManager_1.30.25
#> [76] pROC_1.18.5 ipred_0.9-15 jsonlite_1.8.9
#> [79] R6_2.5.1 plyr_1.8.9