GrafGen: Classifying Subpopulations of H. pylori Genomes

Introduction

The GrafGen package is for classifying Helicobacter pylori genomes according to genetic distance from nine reference populations as defined by equation 2 in Jin (2019). The main function is this package is grafGen() which requires a file of genotypes that can be either a PLINK bed file or a VCF file.

Installing the GrafGen package from Bioconductor

    if (!requireNamespace("BiocManager", quietly = TRUE)) 
        install.packages("BiocManager") 
    BiocManager::install("GrafGen") 

Loading the package

Before using the GrafGen package, it must be loaded into an R session.

library(GrafGen)

Example data

The GrafGen package includes example data which is a subset of the reference data that was used to train the model. The data is stored in the extdata folder.

    dir <- system.file("extdata", package="GrafGen", mustWork=TRUE)
    geno.file <- paste0(dir, .Platform$file.sep, "data.vcf.gz")
    print(geno.file)
## [1] "/tmp/Rtmpym8yuj/Rinst1d0f51f35b7e/GrafGen/extdata/data.vcf.gz"

Running grafGen()

The grafGen() function returns a list of class “grafpop” with two objects: table and vertex. The object table is a data frame containing hypothetical ancestry percents (F_percent, E_percent and A_percent) based on known African, European and Asian samples, respectively, normalized genetic distance scores (GD1_x, GD2_y, GD3_z), the predicted reference population (Refpop), nearest neighboring reference population, percent separation as defined in the user manual and the genetic distances to each reference populations (hpgpAfrica, hpgpAfrica-distant, hpgpAfroamerica, hpgpEuroamerica, hpgpMediterranea, hpgpEurope, hpgpEurasia, hpgpAsia, and hpgpAklavik86-like).
The object vertex is a list containing the (fixed) x-y coordinates of the African, European and Asian vertex population centroids.

ret <- grafGen(geno.file, print=0)
ret$table[seq_len(5), ]
##         Sample N_SNPs    GD1_x    GD2_y     GD3_z F_percent E_percent A_percent
## 1 HpGP-ALG-002  35528 1.325330 1.246303 -0.008719     27.79     72.21         0
## 2 HpGP-ALG-004  35528 1.355911 1.264769  0.004511     19.76     80.24         0
## 3 HpGP-ALG-005  35528 1.350071 1.267337 -0.003531     19.70     80.30         0
## 4 HpGP-ALG-006  35528 1.340957 1.265292 -0.002128     21.14     78.86         0
## 5 HpGP-ALG-010  35528 1.343997 1.266336  0.003096     20.57     79.43         0
##   hpgpAfrica hpgpAfrica-distant hpgpAfroamerica hpgpEuroamerica
## 1   0.398096           0.661930        0.324232        0.276226
## 2   0.429534           0.660384        0.336875        0.279032
## 3   0.420432           0.655047        0.331030        0.275570
## 4   0.416124           0.658398        0.327399        0.275852
## 5   0.422221           0.657674        0.333455        0.279073
##   hpgpMediterranea hpgpEurope hpgpEurasia hpgpAsia hpgpAklavik86-like
## 1         0.277100   0.305868    0.395434 0.577032           0.597266
## 2         0.260454   0.289007    0.380532 0.564095           0.587367
## 3         0.256573   0.284675    0.379566 0.565708           0.589778
## 4         0.257503   0.288340    0.383189 0.573169           0.592772
## 5         0.261808   0.288747    0.384723 0.573536           0.594683
##             Refpop Nearest_neighbor Separation_percent
## 1  hpgpEuroamerica hpgpMediterranea               0.32
## 2 hpgpMediterranea  hpgpEuroamerica               7.13
## 3 hpgpMediterranea  hpgpEuroamerica               7.40
## 4 hpgpMediterranea  hpgpEuroamerica               7.13
## 5 hpgpMediterranea  hpgpEuroamerica               6.59

Printing the return object from grafGen() will display a table of frequency counts for the predicted reference populations for the user input data.

print(ret)
## 
## Predicted reference population counts:
##         hpgpAfrica hpgpAfrica-distant    hpgpAfroamerica    hpgpEuroamerica 
##                 15                  1                 10                 28 
##   hpgpMediterranea         hpgpEurope        hpgpEurasia           hpgpAsia 
##                 44                 57                 13                 36 
## hpgpAklavik86-like 
##                  2

Plotting the return object will display a plot of the genetic distance scores (GD1_x vs GD2_y) for the user input data and the reference data. Additional plots can be obtained by calling the grafGenPlot() function.

plot(ret)

Interactive plots

The functions interactiveReferencePlot and interactivePlot create interactive plots for the reference data and user input data respectively. A call to interactiveReferencePlot will all show the results of all samples in the reference data. Hovering over a point in the plot will display three lines of information. Line 1 contains the type and id of that sample. Line 2 contains the sample’s reference population, next nearest reference population, and separation percent to the next nearest reference population as defined in the user manual. Line 3 contains the percent African, European and Asian ancestry for that sample. The legend shows the types (which are the source countries in interactiveReferencePlot) for all samples, and clicking the name of a type will add or remove those samples from the plot.

if (interactive()) interactiveReferencePlot()

R shiny app

The GrafGen package also includes an R shiny app to view and filter the plot using up to two variables. The function createApp returns a list containing the app and data objects needed with the app. The app then can be launched with the runApp function.

tmp <- createApp(ret)
if (interactive()) {
    reference_results <- tmp$reference_results
    user_results      <- tmp$user_results
    user_metadata     <- tmp$user_metadata
    shiny::runApp(tmp$app)
}

Session Information

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] GrafGen_1.3.0  rmarkdown_2.28
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6            xfun_0.48               bslib_0.8.0            
##  [4] ggplot2_3.5.1           htmlwidgets_1.6.4       rstatix_0.7.2          
##  [7] vctrs_0.6.5             tools_4.4.1             generics_0.1.3         
## [10] stats4_4.4.1            tibble_3.2.1            fansi_1.0.6            
## [13] highr_0.11              pkgconfig_2.0.3         data.table_1.16.2      
## [16] RColorBrewer_1.1-3      S4Vectors_0.43.2        GenomeInfoDbData_1.2.13
## [19] lifecycle_1.0.4         farver_2.1.2            compiler_4.4.1         
## [22] stringr_1.5.1           munsell_0.5.1           carData_3.0-5          
## [25] GenomeInfoDb_1.41.2     httpuv_1.6.15           htmltools_0.5.8.1      
## [28] sys_3.4.3               buildtools_1.0.0        sass_0.4.9             
## [31] yaml_2.3.10             lazyeval_0.2.2          Formula_1.2-5          
## [34] plotly_4.10.4           pillar_1.9.0            later_1.3.2            
## [37] car_3.1-3               ggpubr_0.6.0            jquerylib_0.1.4        
## [40] tidyr_1.3.1             MASS_7.3-61             cachem_1.1.0           
## [43] abind_1.4-8             mime_0.12               tidyselect_1.2.1       
## [46] digest_0.6.37           stringi_1.8.4           dplyr_1.1.4            
## [49] purrr_1.0.2             maketools_1.3.1         cowplot_1.1.3          
## [52] fastmap_1.2.0           grid_4.4.1              colorspace_2.1-1       
## [55] cli_3.6.3               magrittr_2.0.3          utf8_1.2.4             
## [58] broom_1.0.7             withr_3.0.2             UCSC.utils_1.1.0       
## [61] scales_1.3.0            promises_1.3.0          backports_1.5.0        
## [64] XVector_0.45.0          httr_1.4.7              ggsignif_0.6.4         
## [67] shiny_1.9.1             evaluate_1.0.1          knitr_1.48             
## [70] IRanges_2.39.2          GenomicRanges_1.57.2    viridisLite_0.4.2      
## [73] rlang_1.1.4             Rcpp_1.0.13             xtable_1.8-4           
## [76] glue_1.8.0              BiocGenerics_0.53.0     jsonlite_1.8.9         
## [79] R6_2.5.1                zlibbioc_1.51.2