Vignette of the pengls package

Introduction

This vignette demonstrates the use of the pengls package for high-dimensional data with spatial or temporal autocorrelation. It consists of an iterative loop around the nlme and glmnet packages. Currently, only continuous outcomes and R2 and MSE as performance measure are implemented.

Installation instuctions

The pengls package is available from BioConductor, and can be installed as follows:

library(BiocManager)
install("pengls")

Once installed, it can be loaded and version info printed.

suppressPackageStartupMessages(library(pengls))
cat("pengls package version", as.character(packageVersion("pengls")), "\n")
## pengls package version 1.11.0

Illustration

Spatial autocorrelation

We first create a toy dataset with spatial coordinates.

library(nlme)
n <- 25 #Sample size
p <- 50 #Number of features
g <- 15 #Size of the grid
#Generate grid
Grid <- expand.grid("x" = seq_len(g), "y" = seq_len(g))
# Sample points from grid without replacement
GridSample <- Grid[sample(nrow(Grid), n, replace = FALSE),]
#Generate outcome and regressors
b <- matrix(rnorm(p*n), n , p)
a <- rnorm(n, mean = b %*% rbinom(p, size = 1, p = 0.25), sd = 0.1) #25% signal
#Compile to a matrix
df <- data.frame("a" = a, "b" = b, GridSample)

The pengls method requires prespecification of a functional form for the autocorrelation. This is done through the corStruct objects defined by the nlme package. We specify a correlation decaying as a Gaussian curve with distance, and with a nugget parameter. The nugget parameter is a proportion that indicates how much of the correlation structure explained by independent errors; the rest is attributed to spatial autocorrelation. The starting values are chosen as reasonable guesses; they will be overwritten in the fitting process.

# Define the correlation structure (see ?nlme::gls), with initial nugget 0.5 and range 5
corStruct <- corGaus(form = ~ x + y, nugget = TRUE, value = c("range" = 5, "nugget" = 0.5))

Finally the model is fitted with a single outcome variable and large number of regressors, with the chosen covariance structure and for a prespecified penalty parameter λ = 0.2.

#Fit the pengls model, for simplicity for a simple lambda
penglsFit <- pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE), glsSt = corStruct, lambda = 0.2, verbose = TRUE)
## Starting iterations...
## Iteration 1 
## Iteration 2 
## Iteration 3

Standard extraction functions like print(), coef() and predict() are defined for the new “pengls” object.

penglsFit
## pengls model with correlation structure: corGaus 
##  and 17 non-zero coefficients
penglsCoef <- coef(penglsFit)
penglsPred <- predict(penglsFit)

Temporal autocorrelation

The method can also account for temporal autocorrelation by defining another correlation structure from the nlme package, e.g. autocorrelation structure of order 1:

set.seed(354509)
n <- 100 #Sample size
p <- 10 #Number of features
#Generate outcome and regressors
b <- matrix(rnorm(p*n), n , p)
a <- rnorm(n, mean = b %*% rbinom(p, size = 1, p = 0.25), sd = 0.1) #25% signal
#Compile to a matrix
dfTime <- data.frame("a" = a, "b" = b, "t" = seq_len(n))
corStructTime <- corAR1(form = ~ t, value = 0.5)

The fitting command is similar, this time the λ parameter is found through cross-validation of the naive glmnet (for full cross-validation , see below). We choose α = 0.5 this time, fitting an elastic net model.

penglsFitTime <- pengls(data = dfTime, outVar = "a", verbose = TRUE,
xNames = grep(names(dfTime), pattern = "b", value =TRUE),
glsSt = corStructTime, nfolds = 5, alpha = 0.5)
## Fitting naieve model...
## Starting iterations...
## Iteration 1 
## Iteration 2

Show the output

penglsFitTime
## pengls model with correlation structure: corAR1 
##  and 2 non-zero coefficients

Penalty parameter and cross-validation

The pengls package also provides cross-validation for finding the optimal λ value. If the tuning parameter λ is not supplied, the optimal λ according to cross-validation with the naive glmnet function (the one that ignores dependence) is used. Hence we recommend to use the following function to use cross-validation. Multithreading is supported through the BiocParallel package :

library(BiocParallel)
register(MulticoreParam(3)) #Prepare multithereading
nfolds <- 3 #Number of cross-validation folds

The function is called similarly to cv.glmnet:

penglsFitCV <- cv.pengls(data = df, outVar = "a", xNames = grep(names(df), pattern = "b", value =TRUE), glsSt = corStruct, nfolds = nfolds)

Check the result:

penglsFitCV
## Cross-validated pengls model with correlation structure: corGaus 
##  and 14 non-zero coefficients.
##  3 fold cross-validation yielded an estimated R2 of -0.8964993 .

By default, the 1 standard error is used to determine the optimal value of λ :

penglsFitCV$lambda.1se #Lambda for 1 standard error rule
## [1] 1.503003
penglsFitCV$cvOpt #Corresponding R2
## [1] -0.8964993

Extract coefficients and fold IDs.

head(coef(penglsFitCV))
## [1] 0.09914069 0.72290146 0.00000000 0.00000000 0.16328403 0.00000000
penglsFitCV$foldid #The folds used
## 222 128  66 135 116 164 172  88 150 155 168 153 183  95 106  25  81 154  57 100 
##   1   3   3   2   2   2   1   2   2   3   1   3   1   3   3   2   3   1   2   2 
## 107 167 109  30 174 
##   3   1   1   2   1

By default, blocked cross-validation is used, but random cross-validation is also available (but not recommended for timecourse or spatial data). First we illustrate the different ways graphically, again using the timecourse example:

set.seed(5657)
randomFolds <- makeFolds(nfolds = nfolds, dfTime, "random", "t")
blockedFolds <- makeFolds(nfolds = nfolds, dfTime, "blocked", "t")
plot(dfTime$t, randomFolds, xlab ="Time", ylab ="Fold")
points(dfTime$t, blockedFolds, col = "red")
legend("topleft", legend = c("random", "blocked"), pch = 1, col = c("black", "red"))

To perform random cross-validation

penglsFitCVtime <- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, cvType = "random")

To negate baseline differences at different timepoints, it may be useful to center or scale the outcomes in the cross validation. For instance for centering only:

penglsFitCVtimeCenter <- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, cvType = "blocked", transFun = function(x) x-mean(x))
penglsFitCVtimeCenter$cvOpt #Better performance
## [1] 0.9949127

Alternatively, the mean squared error (MSE) can be used as loss function, rather than the default R2:

penglsFitCVtime <- cv.pengls(data = dfTime, outVar = "a", xNames = grep(names(dfTime), pattern = "b", value =TRUE), glsSt = corStructTime, nfolds = nfolds, loss =  "MSE")

Session info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocParallel_1.39.0 nlme_3.1-166        pengls_1.11.0      
## [4] rmarkdown_2.28     
## 
## loaded via a namespace (and not attached):
##  [1] cli_3.6.3         knitr_1.48        rlang_1.1.4       xfun_0.47        
##  [5] highr_0.11        jsonlite_1.8.8    buildtools_1.0.0  htmltools_0.5.8.1
##  [9] maketools_1.3.0   sys_3.4.2         sass_0.4.9        glmnet_4.1-8     
## [13] grid_4.4.1        evaluate_0.24.0   jquerylib_0.1.4   fastmap_1.2.0    
## [17] yaml_2.3.10       foreach_1.5.2     lifecycle_1.0.4   compiler_4.4.1   
## [21] codetools_0.2-20  Rcpp_1.0.13       lattice_0.22-6    digest_0.6.37    
## [25] R6_2.5.1          parallel_4.4.1    splines_4.4.1     shape_1.4.6.1    
## [29] bslib_0.8.0       Matrix_1.7-0      tools_4.4.1       iterators_1.0.14 
## [33] survival_3.7-0    cachem_1.1.0