Determine population ancestry from DNAm arrays

This notebook describes how to obtain the GLINT software suite for DNAm analysis, and how to run GLINT with the EPISTRUCTURE method for inferring population genetic ancestry/descent from HM450K DNAm array probes. These command line tools are called using a conda virtual environment managed from an R session with the basilisk Bioconductor package. Code in this notebook should work for Mac and Linux-like environments. Consult Rahmani et al. (2017) for more details about the EPISTRUCTURE method.

Obtain the GLINT software

First, obtain the latest version of GLINT online by downloading the source from GitHub.

Also ensure the basilisk Bioconductor/R package is installed. We’ll use this to conveniently manage conda virtual environments to run GLINT from an R session.

BiocManager::install("basilisk")
library(basilisk)

Virtual environment setup

Next, set up a virtual environment using conda. This step is crucial for reproducible research, as it enables control over which software versions to use in a session. This enables running software that is older or no longer maintained, a fairly common task in computational sciences.

Using basilisk, set up a Python 2 environment with seven dependencies (numpy, pandas, scipy, scikit-learn, matplotlib, statsmodels, and cvxopt) for which we specify the version using the form “packagename==version” (e.g. “numpy==1.15”).

env.name <- "glint_env"          # name the new virtual environment
pkg.name <- "recountmethylation" # set env properties
pkgv <- c("python==2.7",         # python version (v2.7)
          "numpy==1.15",         # numpy version (v1.15)
          "pandas==0.22",        # pandas version (v0.22)
          "scipy==1.2",          # scipy version (v1.2)
          "scikit-learn==0.19",  # scikit-learn (v0.19)
          "matplotlib==2.2",     # matplotlib (v2.2)
          "statsmodels==0.9",    # statsmodels (v0.9)
          "cvxopt==1.2")         # cvxopt (v1.2)
glint_env <- BasiliskEnvironment(envname = env.name, pkgname = pkg.name, 
                                 packages = pkgv)
proc <- basiliskStart(glint_env) # define run process
on.exit(basiliskStop(proc))      # define exit process

This makes a BasiliskEnvironment object, glint_env, with a starting process called proc and a session end process specified with on.exit().

Process example DNAm array data

This section shows how to run GLINT on an example dataset, with the EPISTRUCTURE method enabled. It includes info about the required data formats and shows how to adjust for covariates.

To run GLINT on DNAm array stored as a SummarizedExperiment object, first access the test HM450K MethylSet from the minfiData package.

library(minfiData)
ms <- get(data("MsetEx")) # load example MethylSet

Now load the explanatory CpG probe vector for EPISTRUCTURE. This small subset of 4,913 HM450K array probes was found by Rahmani et al. (2017) to be strongly correlated with SNPs informing population ancestry and genetic structure. Access them from recountmethylation as follows.

dpath <- system.file("extdata", "glint_files", 
                     package = "recountmethylation") # get the dir path
cgv.fpath <- file.path(dpath, "glint_epistructure_explanatory-cpgs.rda")
glint.cgv <- get(load(cgv.fpath)) # load explanatory probes

Subset ms, a MethylSet, to include just the 4,913 explanatory probes. This will save considerable disk space and memory for processing very large datasets.

mf <- ms[rownames(ms) %in% glint.cgv,] # filter ms on explanatory probes
dim(mf)                                # mf dimensions: [1] 4913    6

Next, identify desired model covariates from sample metadata, then convert to numeric/float format (required by GLINT). Specify the variables “age” and “sex” corresponding to columns in the file covariates_minfidata.txt.

# get covar -- all vars should be numeric/float
covar <- as.data.frame(colData(mf)[,c("age", "sex")]) # get sample metadata
covar[,"sex"] <- ifelse(covar[,"sex"] == "M", 1, 0)   # relabel sex for glint
# write covariates matrix
covar.fpath <- file.path("covariates_minfidata.txt")  # specify covariate table path
# write table colnames, values
write.table(covar, file = covar.fpath, sep = "\t", row.names = T, 
            col.names = T, append = F, quote = F)     # write covariates table

Now calculate the DNAm fractoins or “Beta-values”. Impute any missing values with row medians, and write the final file to bval_minfidata.txt.

bval.fpath <- file.path("bval_minfidata.txt")     # specify dnam fractions table name
mbval <- t(apply(as.matrix(getBeta(mf)),1,function(ri){
  ri[is.na(ri)] <- median(ri,na.rm=T)             # impute na's with row medians
  return(round(ri, 4))
})); rownames(mbval) <- rownames(mf)              # assign probe ids to row names
write.table(mbval, file = bval.fpath, sep = sepsym, 
            row.names = T, col.names = T, append = F, 
            quote = F)                            # write dnam fractions table

Next, set the system commands to make command line calls from R. Define these manually as strings to be passed to the system() function, specifying the paths to the new minfiData example files.

glint.dpath <- "glint-1.0.4"                         # path to the main glint app dir
glint.pypath <- file.path(glint.dpath, "glint.py")   # path to the glint .py script
data.fpath <- file.path("bval_minfidata.txt")        # path to the DNAm data table
covar.fpath <- file.path("covariates_minfidata.txt") # path to the metadata table
out.fstr <- file.path("glint_results_minfidata")     # base string for ouput results files
covarv <- c("age", "sex")                            # vector of valid covariates
covar.str <- paste0(covarv, collapse = " ")          # format the covariates vector
cmd.str <- paste0(c("python", glint.pypath, 
                    "--datafile", data.fpath, 
                    "--covarfile", covar.fpath, 
                    "--covar", covar.str, 
                    "--epi", "--out", out.fstr), 
                  collapse = " ")                    # get the final command line call

The commands stored as cmd.str include the path to the latest GLINT version, glint.path, the paths to the datafile.txt and covariates.txt tutorial files, the variable names age and gender which are our covariates of interest and correspond to column names in covariates.txt. We also used the --epi flag to ensure the EPISTRUCTURE method is run.

Now run GLINT with basiliskRun(). This should relay system outputs back to our console, which are included as comments in the below code chunk.

basiliskRun(proc, function(cmd.str){system(cmd.str)}, cmd.str = cmd.str) # run glint
# this returns:
# INFO      >>> python glint-1.0.4/glint.py --datafile bval_minfidata.txt --covarfile covariates_minfidata.txt --covar age sex --epi --out glint_results_minfidata
# INFO      Starting GLINT...
# INFO      Validating arguments...
# INFO      Loading file bval_minfidata.txt...
# INFO      Checking for missing values in the data file...
# INFO      Validating covariates file...
# INFO      Loading file covariates_minfidata.txt...
# INFO      New covariates were found: age, sex.
# INFO      Running EPISTRUCTURE...
# INFO      Removing non-informative sites...
# INFO      Including sites...
# INFO      Include sites: 4913 CpGs from the reference list of 4913 CpGs will be included...
# WARNING   Found no sites to exclude.
# INFO      Using covariates age, sex.
# INFO      Regressing out covariates...
# INFO      Running PCA...
# INFO      The first 1 PCs were saved to glint_results_minfidata.epistructure.pcs.txt.
# INFO      Added covariates epi1.
# Validating all dependencies are installed...
# All dependencies are installed
# [1] 0

Since we declared --out glint_results_minfidata, results files are saved with the beginning string “glint_results_minfidata” appended. Logs were saved to the file with the *.glint.log extension, while data were saved to the file with the *.epistructure.pcs.txt extension.

Now inspect the output results data file glint_results_minfidata.epistructure.pcs.txt.

out.fpath <- paste0(out.fpath, ".epistructure.pcs.txt")
res2 <- read.table(out.fpath, sep = "\t")
colnames(res2) <- c("sample", "epistructure.pc")
dim(res2)
## [1] 6 2
res2
##              sample epistructure.pc
## 1 5723646052_R02C02       -53.98643
## 2 5723646052_R04C01       -61.72035
## 3 5723646052_R05C02        53.98641
## 4 5723646053_R04C02        27.26870
## 5 5723646053_R05C02       -27.26869
## 6 5723646053_R06C02        61.72035

The first results column reflects sample IDs from the columns in bval_minfidata.txt. Remaining columns show the EPISTRUCTURE population components. While just one population component calculated in this example, experiment datasets may generate outputs with more than one population component and thus several component columns.

Further reading

For more details about GLINT, see the software documentation and GitHub repo. Additional tutorial files are also available.

Consult Rahmani et al. (2017) for more details about the EPISTRUCTURE method, including the discovery of explanatory CpG probes associated with population structure SNPs.

For more details about setting up virtual environments from R, consult the basilisk package documentation.

Session Info

utils::sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] limma_3.63.0                                       
##  [2] gridExtra_2.3                                      
##  [3] knitr_1.48                                         
##  [4] recountmethylation_1.17.0                          
##  [5] HDF5Array_1.35.0                                   
##  [6] rhdf5_2.50.0                                       
##  [7] DelayedArray_0.33.1                                
##  [8] SparseArray_1.6.0                                  
##  [9] S4Arrays_1.6.0                                     
## [10] abind_1.4-8                                        
## [11] Matrix_1.7-1                                       
## [12] ggplot2_3.5.1                                      
## [13] minfiDataEPIC_1.31.0                               
## [14] IlluminaHumanMethylationEPICanno.ilm10b2.hg19_0.6.0
## [15] IlluminaHumanMethylationEPICmanifest_0.3.0         
## [16] minfiData_0.51.0                                   
## [17] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1 
## [18] IlluminaHumanMethylation450kmanifest_0.4.0         
## [19] minfi_1.53.0                                       
## [20] bumphunter_1.49.0                                  
## [21] locfit_1.5-9.10                                    
## [22] iterators_1.0.14                                   
## [23] foreach_1.5.2                                      
## [24] Biostrings_2.75.0                                  
## [25] XVector_0.46.0                                     
## [26] SummarizedExperiment_1.36.0                        
## [27] Biobase_2.67.0                                     
## [28] MatrixGenerics_1.19.0                              
## [29] matrixStats_1.4.1                                  
## [30] GenomicRanges_1.59.0                               
## [31] GenomeInfoDb_1.43.0                                
## [32] IRanges_2.41.0                                     
## [33] S4Vectors_0.44.0                                   
## [34] BiocGenerics_0.53.0                                
## [35] BiocStyle_2.35.0                                   
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3        sys_3.4.3                
##   [3] jsonlite_1.8.9            magrittr_2.0.3           
##   [5] GenomicFeatures_1.59.0    farver_2.1.2             
##   [7] rmarkdown_2.28            BiocIO_1.17.0            
##   [9] zlibbioc_1.52.0           vctrs_0.6.5              
##  [11] multtest_2.63.0           memoise_2.0.1            
##  [13] Rsamtools_2.22.0          DelayedMatrixStats_1.29.0
##  [15] RCurl_1.98-1.16           askpass_1.2.1            
##  [17] htmltools_0.5.8.1         curl_5.2.3               
##  [19] Rhdf5lib_1.28.0           sass_0.4.9               
##  [21] nor1mix_1.3-3             bslib_0.8.0              
##  [23] plyr_1.8.9                cachem_1.1.0             
##  [25] buildtools_1.0.0          GenomicAlignments_1.43.0 
##  [27] lifecycle_1.0.4           pkgconfig_2.0.3          
##  [29] R6_2.5.1                  fastmap_1.2.0            
##  [31] GenomeInfoDbData_1.2.13   digest_0.6.37            
##  [33] colorspace_2.1-1          siggenes_1.80.0          
##  [35] reshape_0.8.9             AnnotationDbi_1.69.0     
##  [37] RSQLite_2.3.7             base64_2.0.2             
##  [39] labeling_0.4.3            fansi_1.0.6              
##  [41] httr_1.4.7                compiler_4.4.1           
##  [43] beanplot_1.3.1            rngtools_1.5.2           
##  [45] withr_3.0.2               bit64_4.5.2              
##  [47] BiocParallel_1.41.0       DBI_1.2.3                
##  [49] highr_0.11                MASS_7.3-61              
##  [51] openssl_2.2.2             rjson_0.2.23             
##  [53] tools_4.4.1               rentrez_1.2.3            
##  [55] glue_1.8.0                quadprog_1.5-8           
##  [57] restfulr_0.0.15           nlme_3.1-166             
##  [59] rhdf5filters_1.18.0       grid_4.4.1               
##  [61] generics_0.1.3            gtable_0.3.6             
##  [63] tzdb_0.4.0                preprocessCore_1.68.0    
##  [65] tidyr_1.3.1               data.table_1.16.2        
##  [67] hms_1.1.3                 xml2_1.3.6               
##  [69] utf8_1.2.4                pillar_1.9.0             
##  [71] genefilter_1.89.0         splines_4.4.1            
##  [73] dplyr_1.1.4               lattice_0.22-6           
##  [75] survival_3.7-0            rtracklayer_1.66.0       
##  [77] bit_4.5.0                 GEOquery_2.75.0          
##  [79] annotate_1.85.0           tidyselect_1.2.1         
##  [81] maketools_1.3.1           xfun_0.48                
##  [83] scrime_1.3.5              statmod_1.5.0            
##  [85] UCSC.utils_1.2.0          yaml_2.3.10              
##  [87] evaluate_1.0.1            codetools_0.2-20         
##  [89] tibble_3.2.1              BiocManager_1.30.25      
##  [91] cli_3.6.3                 xtable_1.8-4             
##  [93] munsell_0.5.1             jquerylib_0.1.4          
##  [95] Rcpp_1.0.13               png_0.1-8                
##  [97] XML_3.99-0.17             readr_2.1.5              
##  [99] blob_1.2.4                mclust_6.1.1             
## [101] doRNG_1.8.6               sparseMatrixStats_1.18.0 
## [103] bitops_1.0-9              scales_1.3.0             
## [105] illuminaio_0.49.0         purrr_1.0.2              
## [107] crayon_1.5.3              rlang_1.1.4              
## [109] KEGGREST_1.47.0

Works Cited

Rahmani, Elior, Liat Shenhav, Regev Schweiger, Paul Yousefi, Karen Huen, Brenda Eskenazi, Celeste Eng, et al. 2017. “Genome-Wide Methylation Data Mirror Ancestry Information.” Epigenetics & Chromatin 10 (1): 1. https://doi.org/10.1186/s13072-016-0108-y.