Introduction

This vignette demonstrates the use of the pengls package for high-dimensional data with spatial or temporal autocorrelation. It consists of an iterative loop around the nlme and glmnet packages. Currently, only continuous outcomes and R² and MSE as performance measure are implemented.

Installation instuctions

The pengls package is available from BioConductor, and can be installed as follows:

library(BiocManager)
install("pengls")

Once installed, it can be loaded and version info printed.

suppressPackageStartupMessages(library(pengls))
cat("pengls package version", as.character(packageVersion("pengls")), "\n")

## pengls package version 1.13.0

Illustration

Spatial autocorrelation

We first create a toy dataset with spatial coordinates.

library(nlme)
n <- 25 #Sample size
p <- 50 #Number of features
g <- 15 #Size of the grid
#Generate grid
Grid <- expand.grid("x" = seq_len(g), "y" = seq_len(g))
# Sample points from grid without replacement
GridSample <- Grid[sample(nrow(Grid), n, replace = FALSE),]
#Generate outcome and regressors
b <- matrix(rnorm(p*n), n , p)
a <- rnorm(n, mean = b %*% rbinom(p, size = 1, p = 0.25), sd = 0.1) #25% signal
#Compile to a matrix
df <- data.frame("a" = a, "b" = b, GridSample)

The pengls method requires prespecification of a functional form for the autocorrelation. This is done through the corStruct objects defined by the nlme package. We specify a correlation decaying as a Gaussian curve with distance, and with a nugget parameter. The nugget parameter is a proportion that indicates how much of the correlation structure explained by independent errors; the rest is attributed to spatial autocorrelation. The starting values are chosen as reasonable guesses; they will be overwritten in the fitting process.

# Define the correlation structure (see ?nlme::gls), with initial nugget 0.5 and range 5
corStruct <- corGaus(form = ~ x + y, nugget = TRUE, value = c("range" = 5, "nugget" = 0.5))

Finally the model is fitted with a single outcome variable and large number of regressors, with the chosen covariance structure and for a prespecified penalty parameter λ = 0.2.

#Fit the pengls model, for simplicity for a simple lambda
penglsFit <- pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE), glsSt = corStruct, lambda = 0.2, verbose = TRUE)

## Starting iterations...
## Iteration 1 
## Iteration 2 
## Iteration 3 
## Iteration 4 
## Iteration 5 
## Iteration 6 
## Iteration 7 
## Iteration 8 
## Iteration 9 
## Iteration 10 
## Iteration 11 
## Iteration 12 
## Iteration 13 
## Iteration 14 
## Iteration 15 
## Iteration 16 
## Iteration 17 
## Iteration 18 
## Iteration 19 
## Iteration 20 
## Iteration 21 
## Iteration 22 
## Iteration 23 
## Iteration 24 
## Iteration 25 
## Iteration 26 
## Iteration 27 
## Iteration 28 
## Iteration 29 
## Iteration 30

## Warning in pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", : No convergence achieved in pengls!

Standard extraction functions like print(), coef() and predict() are defined for the new “pengls” object.

penglsFit

## pengls model with correlation structure: corGaus 
##  and 14 non-zero coefficients

penglsCoef <- coef(penglsFit)
penglsPred <- predict(penglsFit)

Temporal autocorrelation

The method can also account for temporal autocorrelation by defining another correlation structure from the nlme package, e.g. autocorrelation structure of order 1:

set.seed(354509)
n <- 100 #Sample size
p <- 10 #Number of features
#Generate outcome and regressors
b <- matrix(rnorm(p*n), n , p)
a <- rnorm(n, mean = b %*% rbinom(p, size = 1, p = 0.25), sd = 0.1) #25% signal
#Compile to a matrix
dfTime <- data.frame("a" = a, "b" = b, "t" = seq_len(n))
corStructTime <- corAR1(form = ~ t, value = 0.5)

The fitting command is similar, this time the λ parameter is found through cross-validation of the naive glmnet (for full cross-validation , see below). We choose α = 0.5 this time, fitting an elastic net model.

penglsFitTime <- pengls(data = dfTime, outVar = "a", verbose = TRUE,
xNames = grep(names(dfTime), pattern = "b", value =TRUE),
glsSt = corStructTime, nfolds = 5, alpha = 0.5)

## Fitting naieve model...
## Starting iterations...
## Iteration 1 
## Iteration 2

Show the output

penglsFitTime

## pengls model with correlation structure: corAR1 
##  and 2 non-zero coefficients

Penalty parameter and cross-validation

The pengls package also provides cross-validation for finding the optimal λ value. If the tuning parameter λ is not supplied, the optimal λ according to cross-validation with the naive glmnet function (the one that ignores dependence) is used. Hence we recommend to use the following function to use cross-validation. Multithreading is supported through the BiocParallel package :

library(BiocParallel)
register(MulticoreParam(3)) #Prepare multithereading

nfolds <- 3 #Number of cross-validation folds

The function is called similarly to cv.glmnet:

penglsFitCV <- cv.pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE), glsSt = corStruct, nfolds = nfolds)

Check the result:

penglsFitCV

## Cross-validated pengls model with correlation structure: corGaus 
##  and 50 non-zero coefficients.
##  3 fold cross-validation yielded an estimated R2 of -2.750974 .

By default, the 1 standard error is used to determine the optimal value of λ :

penglsFitCV$lambda.1se #Lambda for 1 standard error rule

## [1] 0.8071125

penglsFitCV$cvOpt #Corresponding R2

## [1] -2.750974

Extract coefficients and fold IDs.

head(coef(penglsFitCV))

## [1] 696.45930167  -2.49065458   2.91039453   0.15912630  -0.03244352
## [6]  -0.11192722

penglsFitCV$foldid #The folds used

## 129 115  71  75  55  39  33 178   7 134 101 208  44 205  36  62  66 175   2  81 
##   3   3   1   1   2   2   2   3   2   1   1   1   1   1   2   2   2   3   2   2 
## 110  64  22 177  11 
##   2   2   2   1   2

By default, blocked cross-validation is used, but random cross-validation is also available (but not recommended for timecourse or spatial data). First we illustrate the different ways graphically, again using the timecourse example:

set.seed(5657)
randomFolds <- makeFolds(nfolds = nfolds, dfTime, "random", "t")
blockedFolds <- makeFolds(nfolds = nfolds, dfTime, "blocked", "t")
plot(dfTime$t, randomFolds, xlab ="Time", ylab ="Fold")
points(dfTime$t, blockedFolds, col = "red")
legend("topleft", legend = c("random", "blocked"), pch = 1, col = c("black", "red"))

To perform random cross-validation

penglsFitCVtime <- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, cvType = "random")

To negate baseline differences at different timepoints, it may be useful to center or scale the outcomes in the cross validation. For instance for centering only:

penglsFitCVtimeCenter <- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, cvType = "blocked", transFun = function(x) x-mean(x))
penglsFitCVtimeCenter$cvOpt #Better performance

## [1] 0.9949127

Alternatively, the mean squared error (MSE) can be used as loss function, rather than the default R²:

penglsFitCVtime <- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, loss =  "MSE")

Session info

sessionInfo()

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocParallel_1.41.2 nlme_3.1-167        pengls_1.13.0      
## [4] rmarkdown_2.29     
## 
## loaded via a namespace (and not attached):
##  [1] cli_3.6.4         knitr_1.49        rlang_1.1.5       xfun_0.51        
##  [5] jsonlite_1.9.0    buildtools_1.0.0  htmltools_0.5.8.1 maketools_1.3.2  
##  [9] sys_3.4.3         sass_0.4.9        glmnet_4.1-8      grid_4.4.2       
## [13] evaluate_1.0.3    jquerylib_0.1.4   fastmap_1.2.0     yaml_2.3.10      
## [17] foreach_1.5.2     lifecycle_1.0.4   compiler_4.4.2    codetools_0.2-20 
## [21] Rcpp_1.0.14       lattice_0.22-6    digest_0.6.37     R6_2.6.1         
## [25] parallel_4.4.2    splines_4.4.2     shape_1.4.6.1     bslib_0.9.0      
## [29] Matrix_1.7-2      tools_4.4.2       iterators_1.0.14  survival_3.8-3   
## [33] cachem_1.1.0