This vignette demonstrates the use of the pengls package for high-dimensional data with spatial or temporal autocorrelation. It consists of an iterative loop around the nlme and glmnet packages. Currently, only continuous outcomes and R2 and MSE as performance measure are implemented.
The pengls package is available from BioConductor, and can be installed as follows:
Once installed, it can be loaded and version info printed.
suppressPackageStartupMessages(library(pengls))
cat("pengls package version", as.character(packageVersion("pengls")), "\n")
## pengls package version 1.13.0
We first create a toy dataset with spatial coordinates.
library(nlme)
n <- 25 #Sample size
p <- 50 #Number of features
g <- 15 #Size of the grid
#Generate grid
Grid <- expand.grid("x" = seq_len(g), "y" = seq_len(g))
# Sample points from grid without replacement
GridSample <- Grid[sample(nrow(Grid), n, replace = FALSE),]
#Generate outcome and regressors
b <- matrix(rnorm(p*n), n , p)
a <- rnorm(n, mean = b %*% rbinom(p, size = 1, p = 0.25), sd = 0.1) #25% signal
#Compile to a matrix
df <- data.frame("a" = a, "b" = b, GridSample)
The pengls method requires prespecification of a functional form for the autocorrelation. This is done through the corStruct objects defined by the nlme package. We specify a correlation decaying as a Gaussian curve with distance, and with a nugget parameter. The nugget parameter is a proportion that indicates how much of the correlation structure explained by independent errors; the rest is attributed to spatial autocorrelation. The starting values are chosen as reasonable guesses; they will be overwritten in the fitting process.
# Define the correlation structure (see ?nlme::gls), with initial nugget 0.5 and range 5
corStruct <- corGaus(form = ~ x + y, nugget = TRUE, value = c("range" = 5, "nugget" = 0.5))
Finally the model is fitted with a single outcome variable and large number of regressors, with the chosen covariance structure and for a prespecified penalty parameter λ = 0.2.
#Fit the pengls model, for simplicity for a simple lambda
penglsFit <- pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE), glsSt = corStruct, lambda = 0.2, verbose = TRUE)
## Starting iterations...
## Iteration 1
## Iteration 2
## Iteration 3
## Iteration 4
## Iteration 5
## Iteration 6
## Iteration 7
## Iteration 8
## Iteration 9
## Iteration 10
## Iteration 11
## Iteration 12
## Iteration 13
## Iteration 14
## Iteration 15
## Iteration 16
## Iteration 17
## Iteration 18
## Iteration 19
## Iteration 20
## Iteration 21
## Iteration 22
## Iteration 23
## Iteration 24
## Iteration 25
## Iteration 26
## Iteration 27
## Iteration 28
## Iteration 29
## Iteration 30
## Warning in pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", : No convergence achieved in pengls!
Standard extraction functions like print(), coef() and predict() are defined for the new “pengls” object.
## pengls model with correlation structure: corGaus
## and 14 non-zero coefficients
The method can also account for temporal autocorrelation by defining another correlation structure from the nlme package, e.g. autocorrelation structure of order 1:
set.seed(354509)
n <- 100 #Sample size
p <- 10 #Number of features
#Generate outcome and regressors
b <- matrix(rnorm(p*n), n , p)
a <- rnorm(n, mean = b %*% rbinom(p, size = 1, p = 0.25), sd = 0.1) #25% signal
#Compile to a matrix
dfTime <- data.frame("a" = a, "b" = b, "t" = seq_len(n))
corStructTime <- corAR1(form = ~ t, value = 0.5)
The fitting command is similar, this time the λ parameter is found through cross-validation of the naive glmnet (for full cross-validation , see below). We choose α = 0.5 this time, fitting an elastic net model.
penglsFitTime <- pengls(data = dfTime, outVar = "a", verbose = TRUE,
xNames = grep(names(dfTime), pattern = "b", value =TRUE),
glsSt = corStructTime, nfolds = 5, alpha = 0.5)
## Fitting naieve model...
## Starting iterations...
## Iteration 1
## Iteration 2
Show the output
## pengls model with correlation structure: corAR1
## and 2 non-zero coefficients
The pengls package also provides cross-validation for finding the optimal λ value. If the tuning parameter λ is not supplied, the optimal λ according to cross-validation with the naive glmnet function (the one that ignores dependence) is used. Hence we recommend to use the following function to use cross-validation. Multithreading is supported through the BiocParallel package :
The function is called similarly to cv.glmnet:
penglsFitCV <- cv.pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE), glsSt = corStruct, nfolds = nfolds)
Check the result:
## Cross-validated pengls model with correlation structure: corGaus
## and 50 non-zero coefficients.
## 3 fold cross-validation yielded an estimated R2 of -2.750974 .
By default, the 1 standard error is used to determine the optimal value of λ :
## [1] 0.8071125
## [1] -2.750974
Extract coefficients and fold IDs.
## [1] 696.45930167 -2.49065458 2.91039453 0.15912630 -0.03244352
## [6] -0.11192722
## 129 115 71 75 55 39 33 178 7 134 101 208 44 205 36 62 66 175 2 81
## 3 3 1 1 2 2 2 3 2 1 1 1 1 1 2 2 2 3 2 2
## 110 64 22 177 11
## 2 2 2 1 2
By default, blocked cross-validation is used, but random cross-validation is also available (but not recommended for timecourse or spatial data). First we illustrate the different ways graphically, again using the timecourse example:
set.seed(5657)
randomFolds <- makeFolds(nfolds = nfolds, dfTime, "random", "t")
blockedFolds <- makeFolds(nfolds = nfolds, dfTime, "blocked", "t")
plot(dfTime$t, randomFolds, xlab ="Time", ylab ="Fold")
points(dfTime$t, blockedFolds, col = "red")
legend("topleft", legend = c("random", "blocked"), pch = 1, col = c("black", "red"))
To perform random cross-validation
penglsFitCVtime <- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, cvType = "random")
To negate baseline differences at different timepoints, it may be useful to center or scale the outcomes in the cross validation. For instance for centering only:
penglsFitCVtimeCenter <- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, cvType = "blocked", transFun = function(x) x-mean(x))
penglsFitCVtimeCenter$cvOpt #Better performance
## [1] 0.9949127
Alternatively, the mean squared error (MSE) can be used as loss function, rather than the default R2:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocParallel_1.41.2 nlme_3.1-167 pengls_1.13.0
## [4] rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.4 knitr_1.49 rlang_1.1.5 xfun_0.51
## [5] jsonlite_1.9.0 buildtools_1.0.0 htmltools_0.5.8.1 maketools_1.3.2
## [9] sys_3.4.3 sass_0.4.9 glmnet_4.1-8 grid_4.4.2
## [13] evaluate_1.0.3 jquerylib_0.1.4 fastmap_1.2.0 yaml_2.3.10
## [17] foreach_1.5.2 lifecycle_1.0.4 compiler_4.4.2 codetools_0.2-20
## [21] Rcpp_1.0.14 lattice_0.22-6 digest_0.6.37 R6_2.6.1
## [25] parallel_4.4.2 splines_4.4.2 shape_1.4.6.1 bslib_0.9.0
## [29] Matrix_1.7-2 tools_4.4.2 iterators_1.0.14 survival_3.8-3
## [33] cachem_1.1.0