This notebook describes how to
obtain the GLINT
software suite for DNAm analysis, and how
to run GLINT
with the EPISTRUCTURE
method for
inferring population genetic ancestry/descent from HM450K DNAm array
probes. These command line tools are called using a conda virtual
environment managed from an R session with the basilisk
Bioconductor package. Code in this notebook should work for Mac and
Linux-like environments. Consult Rahmani et al.
(2017) for more details about the EPISTRUCTURE
method.
GLINT
softwareFirst, obtain the latest version of GLINT
online by
downloading the source from
GitHub.
Also ensure the basilisk
Bioconductor/R package is
installed. We’ll use this to conveniently manage conda virtual
environments to run GLINT
from an R session.
Next, set up a virtual environment using conda. This step is crucial for reproducible research, as it enables control over which software versions to use in a session. This enables running software that is older or no longer maintained, a fairly common task in computational sciences.
Using basilisk
, set up a Python 2 environment with seven
dependencies (numpy
, pandas
,
scipy
, scikit-learn
, matplotlib
,
statsmodels
, and cvxopt
) for which we specify
the version using the form “packagename==version”
(e.g. “numpy==1.15”).
env.name <- "glint_env" # name the new virtual environment
pkg.name <- "recountmethylation" # set env properties
pkgv <- c("python==2.7", # python version (v2.7)
"numpy==1.15", # numpy version (v1.15)
"pandas==0.22", # pandas version (v0.22)
"scipy==1.2", # scipy version (v1.2)
"scikit-learn==0.19", # scikit-learn (v0.19)
"matplotlib==2.2", # matplotlib (v2.2)
"statsmodels==0.9", # statsmodels (v0.9)
"cvxopt==1.2") # cvxopt (v1.2)
glint_env <- BasiliskEnvironment(envname = env.name, pkgname = pkg.name,
packages = pkgv)
proc <- basiliskStart(glint_env) # define run process
on.exit(basiliskStop(proc)) # define exit process
This makes a BasiliskEnvironment
object,
glint_env
, with a starting process called proc
and a session end process specified with on.exit()
.
This section shows how to run GLINT
on an example
dataset, with the EPISTRUCTURE
method enabled. It includes
info about the required data formats and shows how to adjust for
covariates.
To run GLINT
on DNAm array stored as a
SummarizedExperiment
object, first access the test HM450K
MethylSet
from the minfiData
package.
Now load the explanatory CpG probe vector for
EPISTRUCTURE
. This small subset of 4,913 HM450K array
probes was found by Rahmani et al. (2017)
to be strongly correlated with SNPs informing population ancestry and
genetic structure. Access them from recountmethylation
as
follows.
dpath <- system.file("extdata", "glint_files",
package = "recountmethylation") # get the dir path
cgv.fpath <- file.path(dpath, "glint_epistructure_explanatory-cpgs.rda")
glint.cgv <- get(load(cgv.fpath)) # load explanatory probes
Subset ms
, a MethylSet
, to include just the
4,913 explanatory probes. This will save considerable disk space and
memory for processing very large datasets.
mf <- ms[rownames(ms) %in% glint.cgv,] # filter ms on explanatory probes
dim(mf) # mf dimensions: [1] 4913 6
Next, identify desired model covariates from sample metadata, then
convert to numeric/float format (required by GLINT
).
Specify the variables “age” and “sex” corresponding to columns in the
file covariates_minfidata.txt
.
# get covar -- all vars should be numeric/float
covar <- as.data.frame(colData(mf)[,c("age", "sex")]) # get sample metadata
covar[,"sex"] <- ifelse(covar[,"sex"] == "M", 1, 0) # relabel sex for glint
# write covariates matrix
covar.fpath <- file.path("covariates_minfidata.txt") # specify covariate table path
# write table colnames, values
write.table(covar, file = covar.fpath, sep = "\t", row.names = T,
col.names = T, append = F, quote = F) # write covariates table
Now calculate the DNAm fractoins or “Beta-values”. Impute any missing
values with row medians, and write the final file to
bval_minfidata.txt
.
bval.fpath <- file.path("bval_minfidata.txt") # specify dnam fractions table name
mbval <- t(apply(as.matrix(getBeta(mf)),1,function(ri){
ri[is.na(ri)] <- median(ri,na.rm=T) # impute na's with row medians
return(round(ri, 4))
})); rownames(mbval) <- rownames(mf) # assign probe ids to row names
write.table(mbval, file = bval.fpath, sep = sepsym,
row.names = T, col.names = T, append = F,
quote = F) # write dnam fractions table
Next, set the system commands to make command line calls from R.
Define these manually as strings to be passed to the
system()
function, specifying the paths to the new
minfiData
example files.
glint.dpath <- "glint-1.0.4" # path to the main glint app dir
glint.pypath <- file.path(glint.dpath, "glint.py") # path to the glint .py script
data.fpath <- file.path("bval_minfidata.txt") # path to the DNAm data table
covar.fpath <- file.path("covariates_minfidata.txt") # path to the metadata table
out.fstr <- file.path("glint_results_minfidata") # base string for ouput results files
covarv <- c("age", "sex") # vector of valid covariates
covar.str <- paste0(covarv, collapse = " ") # format the covariates vector
cmd.str <- paste0(c("python", glint.pypath,
"--datafile", data.fpath,
"--covarfile", covar.fpath,
"--covar", covar.str,
"--epi", "--out", out.fstr),
collapse = " ") # get the final command line call
The commands stored as cmd.str
include the path to the
latest GLINT
version, glint.path
, the paths to
the datafile.txt
and covariates.txt
tutorial
files, the variable names age
and gender
which
are our covariates of interest and correspond to column names in
covariates.txt
. We also used the --epi
flag to
ensure the EPISTRUCTURE
method is run.
Now run GLINT
with basiliskRun()
. This
should relay system outputs back to our console, which are included as
comments in the below code chunk.
basiliskRun(proc, function(cmd.str){system(cmd.str)}, cmd.str = cmd.str) # run glint
# this returns:
# INFO >>> python glint-1.0.4/glint.py --datafile bval_minfidata.txt --covarfile covariates_minfidata.txt --covar age sex --epi --out glint_results_minfidata
# INFO Starting GLINT...
# INFO Validating arguments...
# INFO Loading file bval_minfidata.txt...
# INFO Checking for missing values in the data file...
# INFO Validating covariates file...
# INFO Loading file covariates_minfidata.txt...
# INFO New covariates were found: age, sex.
# INFO Running EPISTRUCTURE...
# INFO Removing non-informative sites...
# INFO Including sites...
# INFO Include sites: 4913 CpGs from the reference list of 4913 CpGs will be included...
# WARNING Found no sites to exclude.
# INFO Using covariates age, sex.
# INFO Regressing out covariates...
# INFO Running PCA...
# INFO The first 1 PCs were saved to glint_results_minfidata.epistructure.pcs.txt.
# INFO Added covariates epi1.
# Validating all dependencies are installed...
# All dependencies are installed
# [1] 0
Since we declared --out glint_results_minfidata
, results
files are saved with the beginning string “glint_results_minfidata”
appended. Logs were saved to the file with the *.glint.log
extension, while data were saved to the file with the
*.epistructure.pcs.txt
extension.
Now inspect the output results data file
glint_results_minfidata.epistructure.pcs.txt
.
## [1] 6 2
## sample epistructure.pc
## 1 5723646052_R02C02 -53.98643
## 2 5723646052_R04C01 -61.72035
## 3 5723646052_R05C02 53.98641
## 4 5723646053_R04C02 27.26870
## 5 5723646053_R05C02 -27.26869
## 6 5723646053_R06C02 61.72035
The first results column reflects sample IDs from the columns in
bval_minfidata.txt
. Remaining columns show the
EPISTRUCTURE
population components. While just one
population component calculated in this example, experiment datasets may
generate outputs with more than one population component and thus
several component columns.
For more details about GLINT
, see the software documentation
and GitHub repo.
Additional tutorial
files are also available.
Consult Rahmani et al. (2017) for more
details about the EPISTRUCTURE
method, including the
discovery of explanatory CpG probes associated with population structure
SNPs.
For more details about setting up virtual environments from R,
consult the basilisk
package documentation.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] limma_3.63.0
## [2] gridExtra_2.3
## [3] knitr_1.48
## [4] recountmethylation_1.17.0
## [5] HDF5Array_1.35.0
## [6] rhdf5_2.50.0
## [7] DelayedArray_0.33.1
## [8] SparseArray_1.6.0
## [9] S4Arrays_1.6.0
## [10] abind_1.4-8
## [11] Matrix_1.7-1
## [12] ggplot2_3.5.1
## [13] minfiDataEPIC_1.31.0
## [14] IlluminaHumanMethylationEPICanno.ilm10b2.hg19_0.6.0
## [15] IlluminaHumanMethylationEPICmanifest_0.3.0
## [16] minfiData_0.51.0
## [17] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
## [18] IlluminaHumanMethylation450kmanifest_0.4.0
## [19] minfi_1.53.0
## [20] bumphunter_1.49.0
## [21] locfit_1.5-9.10
## [22] iterators_1.0.14
## [23] foreach_1.5.2
## [24] Biostrings_2.75.0
## [25] XVector_0.46.0
## [26] SummarizedExperiment_1.36.0
## [27] Biobase_2.67.0
## [28] MatrixGenerics_1.19.0
## [29] matrixStats_1.4.1
## [30] GenomicRanges_1.59.0
## [31] GenomeInfoDb_1.43.0
## [32] IRanges_2.41.0
## [33] S4Vectors_0.44.0
## [34] BiocGenerics_0.53.0
## [35] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] jsonlite_1.8.9 magrittr_2.0.3
## [5] GenomicFeatures_1.59.0 farver_2.1.2
## [7] rmarkdown_2.28 BiocIO_1.17.0
## [9] zlibbioc_1.52.0 vctrs_0.6.5
## [11] multtest_2.63.0 memoise_2.0.1
## [13] Rsamtools_2.22.0 DelayedMatrixStats_1.29.0
## [15] RCurl_1.98-1.16 askpass_1.2.1
## [17] htmltools_0.5.8.1 curl_5.2.3
## [19] Rhdf5lib_1.28.0 sass_0.4.9
## [21] nor1mix_1.3-3 bslib_0.8.0
## [23] plyr_1.8.9 cachem_1.1.0
## [25] buildtools_1.0.0 GenomicAlignments_1.43.0
## [27] lifecycle_1.0.4 pkgconfig_2.0.3
## [29] R6_2.5.1 fastmap_1.2.0
## [31] GenomeInfoDbData_1.2.13 digest_0.6.37
## [33] colorspace_2.1-1 siggenes_1.80.0
## [35] reshape_0.8.9 AnnotationDbi_1.69.0
## [37] RSQLite_2.3.7 base64_2.0.2
## [39] labeling_0.4.3 fansi_1.0.6
## [41] httr_1.4.7 compiler_4.4.1
## [43] beanplot_1.3.1 rngtools_1.5.2
## [45] withr_3.0.2 bit64_4.5.2
## [47] BiocParallel_1.41.0 DBI_1.2.3
## [49] highr_0.11 MASS_7.3-61
## [51] openssl_2.2.2 rjson_0.2.23
## [53] tools_4.4.1 rentrez_1.2.3
## [55] glue_1.8.0 quadprog_1.5-8
## [57] restfulr_0.0.15 nlme_3.1-166
## [59] rhdf5filters_1.18.0 grid_4.4.1
## [61] generics_0.1.3 gtable_0.3.6
## [63] tzdb_0.4.0 preprocessCore_1.68.0
## [65] tidyr_1.3.1 data.table_1.16.2
## [67] hms_1.1.3 xml2_1.3.6
## [69] utf8_1.2.4 pillar_1.9.0
## [71] genefilter_1.89.0 splines_4.4.1
## [73] dplyr_1.1.4 lattice_0.22-6
## [75] survival_3.7-0 rtracklayer_1.66.0
## [77] bit_4.5.0 GEOquery_2.75.0
## [79] annotate_1.85.0 tidyselect_1.2.1
## [81] maketools_1.3.1 xfun_0.48
## [83] scrime_1.3.5 statmod_1.5.0
## [85] UCSC.utils_1.2.0 yaml_2.3.10
## [87] evaluate_1.0.1 codetools_0.2-20
## [89] tibble_3.2.1 BiocManager_1.30.25
## [91] cli_3.6.3 xtable_1.8-4
## [93] munsell_0.5.1 jquerylib_0.1.4
## [95] Rcpp_1.0.13 png_0.1-8
## [97] XML_3.99-0.17 readr_2.1.5
## [99] blob_1.2.4 mclust_6.1.1
## [101] doRNG_1.8.6 sparseMatrixStats_1.18.0
## [103] bitops_1.0-9 scales_1.3.0
## [105] illuminaio_0.49.0 purrr_1.0.2
## [107] crayon_1.5.3 rlang_1.1.4
## [109] KEGGREST_1.47.0