Title: | support for parallelized estimation of GLMs/GEEs |
---|---|
Description: | This package provides support for parallelized estimation of GLMs/GEEs, catering for dispersed data. |
Authors: | VJ Carey <[email protected]> |
Maintainer: | VJ Carey <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.39.0 |
Built: | 2025-01-20 05:42:56 UTC |
Source: | https://github.com/bioc/parglms |
This package provides support for parallelized estimation of GLMs/GEEs, catering for dispersed data.
The DESCRIPTION file:
Package: | parglms |
Title: | support for parallelized estimation of GLMs/GEEs |
Version: | 1.39.0 |
Author: | VJ Carey <[email protected]> |
Description: | This package provides support for parallelized estimation of GLMs/GEEs, catering for dispersed data. |
Suggests: | RUnit, sandwich, MASS, knitr, GenomeInfoDb, GenomicRanges, gwascat, BiocStyle, rmarkdown |
VignetteBuilder: | knitr |
Depends: | methods |
Imports: | BiocGenerics, BatchJobs, foreach, doParallel |
Maintainer: | VJ Carey <[email protected]> |
License: | Artistic-2.0 |
LazyLoad: | yes |
BiocViews: | statistics, genetics |
ByteCompile: | TRUE |
Config/pak/sysreqs: | libicu-dev |
Repository: | https://bioc.r-universe.dev |
RemoteUrl: | https://github.com/bioc/parglms |
RemoteRef: | HEAD |
RemoteSha: | 1e7c61f86e9495589f7e1765a11aab2887e38761 |
Index of help topics:
parGLM-methods fit GLM-like models with parallelized contributions to sufficient statistics parglms-package support for parallelized estimation of GLMs/GEEs
In version 0.0.0 we established an approach to fitting GLM from
data that have been persistently dispersed and managed by
a Registry
.
VJ Carey <[email protected]>
Maintainer: VJ Carey <[email protected]>
This package shares an objective with the bigglm
methods of biglm
. In bigglm
, a small-RAM-footprint algorithm
is employed, with sequential chunking to update statistics in each iteration.
In parGLM
the footprint is likewise controllable, but statistics
in each iteration are evaluated in parallel over chunks.
showMethods("parGLM")
showMethods("parGLM")
This package addresses the problem of fitting GLM-like models in a scalable way, recognizing that data may be dispersed, with chunks processed in parallel, to create low-dimensional summaries from which model fits may be constructed.
signature(formula = "formula", store = "Registry")
The model data are assumed to lie in the file.dir/jobs/*
folders, with file.dir
defined in the store
, which is
an instance of Registry
.
Additional arguments must be supplied:
a function that serves as a family for stats::glm
a vector of initial values for regression
parameter estimation, must conform to expectations of formula
an integer giving the maximum number of iterations allowed
a numeric giving the tolerance criterion
Failure to specify these triggers a fatal error.
The Registry instance can be modified to include a list element
'extractor'. This must be a function with arguments store
, and
codei. The standard extraction function is
function(store, i) loadResult(store, i)
It must return a data frame, conformant with the expectations of formula
.
Limited checking is performed.
The predict method computes the linear predictor on data identified by jobid in a BatchJobs registry. Results are returned as output of foreach over the jobids specified in the predict call.
Note that setting option parGLM.showiter to TRUE will provide a message tracing progress of the optimization.
if (require(MASS) & require(BatchJobs)) { # here is the 'sharding' of a small dataset data(anorexia) # N = 72 # in .BatchJobs.R: # best setting for sharding a small dataset on a small machine: # cluster.functions = BatchJobs::makeClusterFunctionsInteractive() myr = makeRegistry("abc", file.dir=tempfile()) chs = chunk(1:nrow(anorexia), n.chunks=18) # 4 recs/chunk f = function(x) {library(MASS); data(anorexia); anorexia[x,]} batchMap(myr, f, chs) submitJobs(myr) # now getResult(myr,1) gives back a data.frame waitForJobs(myr) # simple dispersal # now myr is populated oldopt = options()$parGLM.showiter options(parGLM.showiter=TRUE) pp = parGLM( Postwt ~ Treat + Prewt, myr, family=gaussian, binit = c(0,0,0,0), maxit=10, tol=.001 ) print(summary(theLM <- lm(Postwt~Treat+Prewt, data=anorexia))) print(pp$coefficients - coef(theLM)) if (require(sandwich)) { hc0 <- vcovHC(theLM, type="HC0") print(pp$robust.variance - hc0) } } predict(pp, store=myr, jobids=2:3) options(parGLM.showiter=oldopt)
if (require(MASS) & require(BatchJobs)) { # here is the 'sharding' of a small dataset data(anorexia) # N = 72 # in .BatchJobs.R: # best setting for sharding a small dataset on a small machine: # cluster.functions = BatchJobs::makeClusterFunctionsInteractive() myr = makeRegistry("abc", file.dir=tempfile()) chs = chunk(1:nrow(anorexia), n.chunks=18) # 4 recs/chunk f = function(x) {library(MASS); data(anorexia); anorexia[x,]} batchMap(myr, f, chs) submitJobs(myr) # now getResult(myr,1) gives back a data.frame waitForJobs(myr) # simple dispersal # now myr is populated oldopt = options()$parGLM.showiter options(parGLM.showiter=TRUE) pp = parGLM( Postwt ~ Treat + Prewt, myr, family=gaussian, binit = c(0,0,0,0), maxit=10, tol=.001 ) print(summary(theLM <- lm(Postwt~Treat+Prewt, data=anorexia))) print(pp$coefficients - coef(theLM)) if (require(sandwich)) { hc0 <- vcovHC(theLM, type="HC0") print(pp$robust.variance - hc0) } } predict(pp, store=myr, jobids=2:3) options(parGLM.showiter=oldopt)