library(BG2)

Introduction

The BG2 package provides statistical tools for the analysis of Poisson and Binary GWAS data. Currently, BG2 contains functions to perform BG2 which is a novel two step Bayesian procedure that, when compared to single marker association testing (SMA), drastically reduces the rate of false discoveries while maintaining the same level of recall of true causal SNPs. Full details on the BG2 procedure can be found in the manuscript (Xu, Williams, and Ferreira 202X). The two packages in Bioconductor are GWASTools (Gogarten et al. 2012) and GWAS.BAYES (Williams, Ferreira, and Ji 2022). While GWASTools provides frequentist methods, BG2 provides Bayesian methods. While GWAS.Bayes provides methods for analysis of Gaussian data, BG2 provides methods for the analysis of non-Gaussian data.

This vignette explores two toy examples as well as a case study presented in the BG2 manuscript (Xu, Williams, and Ferreira 202X) to illustrate how the functions provided in BG2 perform the BG2 procedure. Data has been simulated under a generalized linear mixed model from 9,000 SNPs of 328 A. Thaliana ecotypes. The BG2 package includes as R objects the simulated data; 9,000 SNPs, the simulated phenotypes (both binary and Poisson), and the kinship matrix used to simulate the data. Further, the Github repo that contains the BG2 package also contains contains the data for the A. Thaliana case study.

Functions

The function implemented in BG2 is described below:

BG2 Performs BG2, as described in the BG2 manuscript (Xu, Williams, and Ferreira 202X), using generalized linear mixed models for a given numeric phenotype vector either binary or assumed Poisson distributed Y, a SNP matrix encoded numerically SNPs, fixed covariates Fixed, and random effects and their projection matrices (covariance and Z respectively). The BG2 function returns the indices of the SNP matrix that were identified in the best model found by the BG2 procedure.

Model/Model Assumptions

The model for GWAS analysis used in the BG2 package is

where

g() is the link function.
y is the vector of phenotype responses. Either binary or Poisson.
X is the matrix of SNPs (single nucleotide polymorphisms).
β is the vector of regression coefficients that contains the effects of the SNPs.
X_f is a matrix of fixed covariates.
β_f is the vector of regression coefficients that contains the effects of the fixed covariates.
Z_i is an incidence matrix relating the random effects α_i to the phenotype response.
α_i is a vector of random effects with covariance matrix Σ_i. Common covariance structures include the identity matrix and kinship matrix.

Currently, BG2 can analyze binary responses (family = "bernoulli") and Poisson responses (family = "poisson"). The BG2 manuscript (Xu, Williams, and Ferreira 202X) provides full details on the priors for β. BG2 utilizes spectral decomposition techniques similar to that of Kang et al. (2008) as well as population parameters previously determined (Zhang et al. 2010) to speed up computation.

Examples

The BG2 function requires a vector of observed phenotypes (either binary or assumed Poisson distributed), a matrix of SNPs, and the specification of the random effects. First, the vector of observed phenotypes must be a numeric vector or a numeric n × 1 matrix. BG2 does not allow the analysis of multiple phenotypes simultaneously. In the BG2 package, there are two simulated phenotype vectors. The first simulated phenotype vector comes from a Poisson generalized linear mixed model with both a kinship random effect and an overdispersion random effect. The data is assumed to have four replicates for each A. Thaliana ecotype. Here are the first five elements of the Poisson simulated vector of phenotypes:

data("Y_poisson")
Y_poisson[1:5]
#> [1]  74  10 215   1   0

The second simulated phenotype vector comes from a binary generalized linear mixed model with only a kinship random effect. The first five elements of the binary simulated vector of phenotypes are

data("Y_binary")
Y_binary[1:5]
#> [1] 0 0 1 1 0

Second, the SNP matrix has to contain numeric values where each column corresponds to a SNP of interest and the ith row corresponds to the ith observed phenotype. In this example, the SNPs are a subset of the A. Thaliana TAIR9 genotype dataset and all SNPs have minor allele frequency greater than 0.01. Each simulated phenotype vector is simulated using this SNP matrix. Here are the first five rows and five columns of the SNP matrix:

data("SNPs")
SNPs[1:5,1:5]
#>      SNP2555 SNP2556 SNP2557 SNP2558 SNP2559
#> [1,]       1       1       1       0       0
#> [2,]       0       1       1       1       1
#> [3,]       0       0       1       1       1
#> [4,]       1       1       0       0       1
#> [5,]       1       1       1       1       1

Third, the kinship matrix is an n × n positive semi-definite matrix containing only numeric values. The ith row or ith column quantifies how observation i is related to other observations. Since, both simulated phenotypes are simulated from the same SNP matrix they also are simulated from the same kinship structure. The first five rows and five columns of the kinship matrix are

data("kinship")
kinship[1:5,1:5]
#>              V1         V2         V3         V4         V5
#> [1,] 0.78515873 0.15800700 0.04264546 0.02057071 0.05643574
#> [2,] 0.15800700 0.78146476 0.05135891 0.01476357 0.05482448
#> [3,] 0.04264546 0.05135891 0.80199976 0.10558970 0.04888596
#> [4,] 0.02057071 0.01476357 0.10558970 0.80030413 0.02935703
#> [5,] 0.05643574 0.05482448 0.04888596 0.02935703 0.78401489

Simulated Data

The function BG2 implements the BG2 method for generalized linear mixed models with either Poisson or Bernoulli distributed responses. This function takes inputs of the observed phenotypes, the SNPs coded numerically, the distributional family the response follows, fixed covariates treated as a matrix, the covariance matrices of the random effects, the design matrices of the random effects, the number of replicates of individuals or ecotypes you may have, and the choice of a fixed value or a prior for the dispersion parameter of the nonlocal prior. Further, the other inputs of BG2 are the FDR nominal level, the maximum number of iterations of the genetic algorithm in the model selection step, and the number of consecutive iterations of the genetic algorithm with the same best model for convergence. The full details of BG2 are available in the BG2 manuscript (Xu, Williams, and Ferreira 202X). The default values of maximum iterations and the number of iterations are the values used in the simulation study in the BG paper, that is, 4000 and 400 respectively.

BG2 Poisson Example

Here we illustrate the use of BG2 with a nominal FDR of 0.05 with Poisson count data. First we specify the covariance matrices for the random effects. The first random effect is assumed to be α₁ ∼ N(0, κ₁K), where K is the realized relationship matrix or kinship matrix. The second random effect is assumed to be α₁ ∼ N(0, κ₂I), where the covariance matrix is an identity matrix times a scalar. This second random effect is to account for overdispersion in the Poisson model. The Covariance argument takes a list of random effect covariance matrices. For this example, the list of covariance matrices is set as:

n <- length(Y_poisson)
covariance <- list()
covariance[[1]] <- kinship
covariance[[2]] <- diag(1, nrow = n, ncol = n)

The design matrices Z_i do not need to be specified in Z as the observations have no other structure such as a grouping structure. Z is set to be NULL implying that Z_i = I_n × n. Further, the number of ecotype replications is 4. Finally, we let the dispersion parameter of the nonlocal prior have a uniform prior.

set.seed(1330)
output_poisson <- BG2(Y=Y_poisson, SNPs=SNPs, Fixed = NULL, 
                      Covariance=covariance, Z=NULL, family="poisson", 
                      replicates=4, Tau="uniform",FDR_Nominal = 0.05, 
                      maxiterations = 4000, runs_til_stop = 400)
output_poisson
#>  [1]  385  450  462  463 1261 1328 1350 1437 2250 3150 4050 7630

BG2 outputs the column indices of the SNPs matrix that are in best model or column indices of SNPs perfectly correlated to SNPs in the best model. The data was generated with causal SNPs at positions 450, 1,350, 2,250, 3,150, 4,050, 4,950, 5,850, 6,750, 7,650,and 8,550. BG2 identifies 5 causal SNPs.

BG2 Binary Example

Here we illustrate the use of BG2 with a nominal FDR of 0.05 with Poisson count data. First we specify the covariance matrices for the random effects. The only random effect is assumed to be α ∼ N(0, κ₁K), where K is the realized relationship matrix or kinship matrix. For this example, the list of covariance matrices is set as:

covariance <- list()
covariance[[1]] <- kinship

In this example, the design matrices Z_i do not need to be specified in Z as the observations have no other structure such as a grouping structure. Z is set to be NULL implying that Z_i = I_n × n. With binary data, setting the number of replicates provides no computation gain and is not required. Finally, we let the dispersion parameter of the nonlocal prior have a Inverse Gamma distribution. Details on the Inverse Gamma Distribution are provide in the BG2 manuscript (Xu, Williams, and Ferreira 202X).

set.seed(1330)
output_binary <- BG2(Y=Y_binary, SNPs=SNPs, Fixed = NULL, 
                     Covariance=covariance, Z=NULL, family="bernoulli", 
                     replicates=NULL, Tau="IG",FDR_Nominal = 0.05, 
                     maxiterations = 4000, runs_til_stop = 400)
output_binary
#> [1]  450 2250 7650 8550

Similarly to the Poisson example in Section 4.1.1, the data was generated with causal SNPs at positions 450, 1,350, 2,250, 3,150, 4,050, 4,950, 5,850, 6,750, 7,650,and 8,550. BG2 identifies 4 causal SNPs and no false SNPs.

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] BG2_1.7.0        BiocStyle_2.35.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6         xfun_0.51            bslib_0.9.0         
#>  [4] ggplot2_3.5.1        recipes_1.1.1        lattice_0.22-6      
#>  [7] vctrs_0.6.5          tools_4.4.2          generics_0.1.3      
#> [10] stats4_4.4.2         parallel_4.4.2       tibble_3.2.1        
#> [13] pkgconfig_2.0.3      ModelMetrics_1.2.2.2 Matrix_1.7-2        
#> [16] data.table_1.17.0    lifecycle_1.0.4      compiler_4.4.2      
#> [19] stringr_1.5.1        munsell_0.5.1        codetools_0.2-20    
#> [22] htmltools_0.5.8.1    sys_3.4.3            buildtools_1.0.0    
#> [25] class_7.3-23         sass_0.4.9           yaml_2.3.10         
#> [28] prodlim_2024.06.25   crayon_1.5.3         pillar_1.10.1       
#> [31] jquerylib_0.1.4      MASS_7.3-65          cachem_1.1.0        
#> [34] gower_1.0.2          iterators_1.0.14     rpart_4.1.24        
#> [37] foreach_1.5.2        nlme_3.1-167         parallelly_1.42.0   
#> [40] lava_1.8.1           tidyselect_1.2.1     digest_0.6.37       
#> [43] stringi_1.8.4        future_1.34.0        dplyr_1.1.4         
#> [46] reshape2_1.4.4       purrr_1.0.4          listenv_0.9.1       
#> [49] maketools_1.3.2      splines_4.4.2        fastmap_1.2.0       
#> [52] grid_4.4.2           colorspace_2.1-1     cli_3.6.4           
#> [55] magrittr_2.0.3       survival_3.8-3       future.apply_1.11.3 
#> [58] withr_3.0.2          scales_1.3.0         lubridate_1.9.4     
#> [61] timechange_0.3.0     rmarkdown_2.29       globals_0.16.3      
#> [64] nnet_7.3-20          timeDate_4041.110    GA_3.2.4            
#> [67] memoise_2.0.1        evaluate_1.0.3       knitr_1.49          
#> [70] hardhat_1.4.1        caret_7.0-1          rlang_1.1.5         
#> [73] Rcpp_1.0.14          glue_1.8.0           BiocManager_1.30.25 
#> [76] pROC_1.18.5          ipred_0.9-15         jsonlite_1.9.1      
#> [79] R6_2.6.1             plyr_1.8.9

References

Gogarten, Stephanie M, Tushar Bhangale, Matthew P Conomos, Cecelia A Laurie, Caitlin P McHugh, Ian Painter, Xiuwen Zheng, et al. 2012. “GWASTools: An r/Bioconductor Package for Quality Control and Analysis of Genome-Wide Association Studies.” Bioinformatics 28 (24): 3329–31.

Kang, Hyun Min, Noah A. Zaitlen, Claire M. Wade, Andrew Kirby, David Heckerman, Mark J. Daly, and Eleazar Eskin. 2008. “Efficient Control of Population Structure in Model Organism Association Mapping.” Genetics 178 (3): 1709–23. https://doi.org/10.1534/genetics.107.080101.

Williams, Jacob, Marco A. R. Ferreira, and Tieming Ji. 2022. “BICOSS: Bayesian Iterative Conditional Stochastic Search for GWAS.” BMC Bioinformatics 23 (475): 475. https://doi.org/10.1186/s12859-022-05030-0.

Xu, Shuangshuang, Jacob Williams, and Marco A. R. Ferreira. 202X. “BG2: Bayesian Variable Selection in Generalized Linear Mixed Models with Non-Local Priors for Non-Gaussian GWAS Data.” Bioinformatics Submitted (202X).

Zhang, Zhiwu, Elhan Ersoz, Chao-Qiang Lai, Rory J Todhunter, Hemant K Tiwari, Michael A Gore, Peter J Bradbury, et al. 2010. “Mixed Linear Model Approach Adapted for Genome-Wide Association Studies.” Nature Genetics 42 (4): 355–60.