DelayedArrays of random values

Introduction

The DelayedRandomArray package implements DelayedArray subclasses containing dynamically sampled random values. Specifically, the actual values are never fully held in memory but are generated when the relevant part of the array is accessed. This allows users to create very large arrays of random values that would not otherwise be possible by filling an ordinary matrix.

To install the package, follow the instructions on DelayedRandomArray landing page. Using the package is then as simple as:

library(DelayedRandomArray)
X <- RandomUnifArray(c(1e6, 1e6))
X
## <1000000 x 1000000> RandomUnifMatrix object of type "double":
##                  [,1]       [,2]       [,3] ...  [,999999] [,1000000]
##       [1,] 0.30192952 0.88189886 0.13789042   . 0.40605853 0.32756703
##       [2,] 0.06198537 0.43109308 0.79817161   . 0.21343589 0.25680141
##       [3,] 0.28092404 0.60931509 0.24974411   . 0.02046445 0.45452674
##       [4,] 0.43079043 0.36502830 0.61924259   . 0.38151761 0.26443651
##       [5,] 0.50665715 0.13282313 0.62808957   . 0.12936370 0.86747424
##        ...          .          .          .   .          .          .
##  [999996,]  0.6059753  0.2812328  0.2717718   .  0.3724968  0.1371885
##  [999997,]  0.9122549  0.8393135  0.8549976   .  0.1990298  0.8783226
##  [999998,]  0.9729627  0.9965181  0.9935182   .  0.3656284  0.5716273
##  [999999,]  0.4187823  0.8620181  0.8629796   .  0.2573691  0.8334974
## [1000000,]  0.6381792  0.2430567  0.4198230   .  0.4852549  0.4704308

The resulting array can be used in any pipeline that is compatible with DelayedArray objects. This object occupies only 64 MB in memory, whereas an ordinary matrix would require 8 PB instead.

Available distributions

Almost every distribution in stats is available here. To list a few:

RandomNormArray(c(100, 50))
## <100 x 50> RandomNormMatrix object of type "double":
##               [,1]        [,2]        [,3] ...      [,49]      [,50]
##   [1,] -0.91667165 -0.69230503  0.05242354   .  0.5323146 -1.0043177
##   [2,] -0.42186933 -0.35945254 -1.29500539   . -0.3251134  0.1298343
##   [3,]  0.11032688 -0.44091938  0.27883233   . -0.4697724 -1.9551868
##   [4,]  1.06721755 -1.28149464 -0.66048405   .  1.2157237 -0.7733014
##   [5,]  0.45875250 -0.14050526 -0.20262124   .  0.7450937 -1.0740956
##    ...           .           .           .   .          .          .
##  [96,] -0.57791742 -0.69093484 -0.59001301   .  1.4624963  0.3593126
##  [97,] -1.17704209  0.96222037  0.14480865   . -1.2492447 -1.6800859
##  [98,]  1.22562323 -0.09022698  1.42159070   . -1.1369267  1.2245239
##  [99,] -1.64544696  0.55464486  0.35338749   .  0.6196536 -0.3266343
## [100,]  0.09280476 -0.37874489 -0.01315817   .  1.1694286 -0.7192306
RandomPoisArray(c(100, 50), lambda=5)
## <100 x 50> RandomPoisMatrix object of type "double":
##         [,1]  [,2]  [,3] ... [,49] [,50]
##   [1,]     3     5     8   .     5     3
##   [2,]     3     3     1   .     5     4
##   [3,]     2     5     5   .     4     3
##   [4,]     6     4     7   .     4     7
##   [5,]     2     3     4   .     7     3
##    ...     .     .     .   .     .     .
##  [96,]     6     6     7   .     6     7
##  [97,]     6     5     2   .     5     9
##  [98,]     6     8     4   .     5     7
##  [99,]     4     7     6   .     5     7
## [100,]     1     4     5   .     6     9
RandomGammaArray(c(100, 50), shape=2, rate=5)
## <100 x 50> RandomGammaMatrix object of type "double":
##              [,1]       [,2]       [,3] ...      [,49]      [,50]
##   [1,] 0.56967293 0.11405118 0.69754905   .  0.1025578  0.4612366
##   [2,] 0.22008639 1.18873787 0.28493253   .  0.2765653  0.1370805
##   [3,] 1.69537812 0.58775621 0.06817447   .  0.4871070  0.1872452
##   [4,] 0.24930370 0.20639910 0.28647266   .  0.1762756  0.7404448
##   [5,] 0.23982925 0.08949021 0.18803156   .  0.5277130  0.1221830
##    ...          .          .          .   .          .          .
##  [96,]  0.3684005  0.1495722  0.3082390   . 0.32649142 0.55974516
##  [97,]  0.3841857  0.3990208  0.1424009   . 0.10174332 0.26303415
##  [98,]  0.7133285  0.1081969  0.1650089   . 0.69679479 0.03727454
##  [99,]  0.4390049  0.5432918  0.2384737   . 0.77241241 0.53033346
## [100,]  0.2058354  0.7421198  0.7738918   . 0.05850282 0.25263103
RandomWeibullArray(c(100, 50), shape=5)
## <100 x 50> RandomWeibullMatrix object of type "double":
##             [,1]      [,2]      [,3] ...     [,49]     [,50]
##   [1,] 0.3661683 0.5211680 0.9417071   . 0.8309481 0.9096787
##   [2,] 0.9165740 0.5417126 0.8434901   . 0.6728498 1.0789552
##   [3,] 0.9061442 1.0052422 1.2168514   . 0.8463653 0.7355764
##   [4,] 0.8684932 0.9694717 0.7581332   . 0.9044083 1.1723951
##   [5,] 1.0675663 0.9214778 0.8705093   . 1.0451657 0.8431389
##    ...         .         .         .   .         .         .
##  [96,] 0.9476413 1.2687862 1.0189272   . 0.5272767 1.1024288
##  [97,] 1.0212187 1.1321577 0.6907849   . 0.9914368 1.0287482
##  [98,] 0.4466846 1.0035882 1.0468357   . 0.8659832 1.1468748
##  [99,] 0.8095865 1.0108933 1.1792327   . 0.4677391 0.9308053
## [100,] 0.6467069 0.6532956 0.9068343   . 0.8522492 0.6662050

Distributional parameters can either be scalars:

RandomNormArray(c(100, 50), mean=1)
## <100 x 50> RandomNormMatrix object of type "double":
##               [,1]        [,2]        [,3] ...      [,49]      [,50]
##   [1,]   1.6271004   0.9417343  -0.8299566   .  0.3578880  2.8498174
##   [2,]  -0.5678926   0.3224457   0.1753453   .  1.7869522  1.2763049
##   [3,]   0.4152065  -0.1537246   2.4184335   .  0.1351299  1.0607055
##   [4,]   1.8116126   1.5771291   1.8320494   .  0.2546360  0.4677102
##   [5,]   2.0131716   1.4647709   0.9376699   .  0.3229439 -0.6140739
##    ...           .           .           .   .          .          .
##  [96,]  0.67796728  2.65836630  0.23531014   .  2.0109627  2.2030995
##  [97,]  1.26781258 -1.62733008  2.08079247   .  0.9406402 -0.6655255
##  [98,]  0.90157681 -0.29780861  0.67364775   .  0.6262152  1.1458155
##  [99,]  1.01781372  1.60730906 -0.02888258   .  1.0971746  1.5110251
## [100,]  0.94678510  2.36440205  0.66403493   .  1.5644434  0.7259310

Or vectors, which are recycled along the length of the array:

RandomNormArray(c(100, 50), mean=1:100)
## <100 x 50> RandomNormMatrix object of type "double":
##             [,1]      [,2]      [,3] ...     [,49]     [,50]
##   [1,] 0.7858669 0.9526607 1.6461016   . 0.4983916 2.0423993
##   [2,] 2.1839920 2.3660165 2.6697538   . 0.7944005 2.4216325
##   [3,] 3.6516961 3.0425649 2.5307200   . 3.1758542 4.8473513
##   [4,] 4.1546661 5.1142871 5.0060637   . 3.8286239 4.5313452
##   [5,] 4.4629901 5.1857609 5.1476113   . 5.6448361 4.7955304
##    ...         .         .         .   .         .         .
##  [96,]  94.74345  97.65181  95.94134   .  95.63206  98.13411
##  [97,]  95.88932  98.37645  96.85785   .  98.53561  96.77352
##  [98,]  97.90016  98.29207  97.75492   .  96.35182  97.29571
##  [99,]  97.80403  99.44498  99.26830   .  98.82555  98.86021
## [100,] 100.29459  99.83537 100.72975   .  98.53319 100.25292

Or other arrays of the same dimensions, which are used to sample the corresponding parts of the random array:

means <- RandomNormArray(c(100, 50))
RandomPoisArray(c(100, 50), lambda=2^means)
## <100 x 50> RandomPoisMatrix object of type "double":
##         [,1]  [,2]  [,3] ... [,49] [,50]
##   [1,]     0     2     0   .     2     0
##   [2,]     2     0     1   .     7     1
##   [3,]     0     0     1   .     1     0
##   [4,]     0     0     0   .     2     0
##   [5,]     1     3     0   .     0     0
##    ...     .     .     .   .     .     .
##  [96,]     1     4     1   .     0     3
##  [97,]     1     1     0   .     0     0
##  [98,]     3     0     1   .     2     1
##  [99,]     1     3     2   .     0     2
## [100,]     0     3     0   .     0     1

For example, a hypothetical simulation of a million-cell single-cell RNA-seq dataset might look like this:

ngenes <- 20000
log.abundances <- runif(ngenes, -2, 5)

nclusters <- 20 # define 20 clusters and their population means.
cluster.means <- matrix(2^rnorm(ngenes*nclusters, log.abundances, sd=2), ncol=nclusters)

ncells <- 1e6
clusters <- sample(nclusters, ncells, replace=TRUE) # randomly allocate cells
cell.means <- DelayedArray(cluster.means)[,clusters]

dispersions <- 0.05 + 10/cell.means # typical mean variance trend.

y <- RandomNbinomArray(c(ngenes, ncells), mu=cell.means, size=1/dispersions)
y
## <20000 x 1000000> RandomNbinomMatrix object of type "double":
##                [,1]       [,2]       [,3] ...  [,999999] [,1000000]
##     [1,]          2         78          1   .         80          5
##     [2,]          0          0          0   .          0          0
##     [3,]          1          1          0   .          1          0
##     [4,]          0          3          0   .          4          7
##     [5,]          0          0          2   .          0          0
##      ...          .          .          .   .          .          .
## [19996,]          0          0          0   .          0          0
## [19997,]         44          5          0   .         23          1
## [19998,]          5          2          0   .          7         12
## [19999,]          0         25          0   .          0          0
## [20000,]          0          0          0   .          6          0

Chunking

Each random DelayedArrays is broken into contiguous rectangular chunks of identical size and shape. Each chunk is assigned a seed at construction time that is used to initialize a random number stream (using the PCG32 generator from the dqrng package). When the user accesses any part of the array, we generate the random numbers in the overlapping chunks and return the desired values. This provides efficient random access to any subarray without the need to use any jump-ahead functionality.

The chunking scheme determines the efficiency of accessing our random DelayedArrays. Chunks that are too large require unnecessary number generation when a subarray is requested, while chunks that are too small would increase memory usage and book-keeping overhead. The “best” choice also depends on the downstream access pattern, if such information is known. For example, in a matrix where each column is a chunk, retrieval of a column would be very efficient while retrieval of a single row would be very slow. The default chunk dimensions are set to the square root of the array dimensions (or 100, whichever is larger), providing a reasonable compromise between all of these considerations. This can also be manually specified with the chunkdim= argument.

# Row-wise chunks:
RandomUnifArray(c(1000, 500), chunkdim=c(1, 500))
## <1000 x 500> RandomUnifMatrix object of type "double":
##               [,1]       [,2]       [,3] ...     [,499]     [,500]
##    [1,] 0.75514814 0.09455797 0.67032830   .  0.5220478  0.8447784
##    [2,] 0.18402828 0.14908845 0.16842666   .  0.3590315  0.6041378
##    [3,] 0.88538697 0.98721098 0.80068777   .  0.3340073  0.3737720
##    [4,] 0.86900542 0.13082417 0.13049999   .  0.2851445  0.9526614
##    [5,] 0.83659679 0.24320085 0.65012324   .  0.8444787  0.0126758
##     ...          .          .          .   .          .          .
##  [996,]  0.4537080  0.3846833  0.1623065   . 0.16905022 0.99381300
##  [997,]  0.4002372  0.3405630  0.3046205   . 0.68369134 0.18999845
##  [998,]  0.8619803  0.2892201  0.2051199   . 0.52791046 0.28215433
##  [999,]  0.1295508  0.9675237  0.7448341   . 0.24583872 0.77685991
## [1000,]  0.4823558  0.2244517  0.4766629   . 0.09814071 0.77033732
# Column-wise chunks:
RandomUnifArray(c(1000, 500), chunkdim=c(1000, 1))
## <1000 x 500> RandomUnifMatrix object of type "double":
##               [,1]       [,2]       [,3] ...      [,499]      [,500]
##    [1,]  0.3975883  0.3099393  0.5339272   . 0.294558755 0.533025398
##    [2,]  0.3412318  0.3867555  0.2244418   . 0.237083409 0.778487574
##    [3,]  0.4834278  0.9753835  0.8250138   . 0.214358950 0.082422410
##    [4,]  0.8813596  0.1279683  0.4750817   . 0.962084805 0.548938826
##    [5,]  0.1925515  0.1801415  0.2139640   . 0.887157912 0.002877622
##     ...          .          .          .   .           .           .
##  [996,] 0.47078327 0.46148727 0.91205026   .   0.5999716   0.9957839
##  [997,] 0.71811211 0.48143623 0.89618131   .   0.8833691   0.8853602
##  [998,] 0.76673861 0.01677927 0.71901624   .   0.8982029   0.7241702
##  [999,] 0.86805271 0.56156021 0.99636899   .   0.8826324   0.4823754
## [1000,] 0.63883150 0.38561844 0.04546727   .   0.1101447   0.4138880

Unlike other chunk-based DelayedArrays, the actual values of the random DelayedArray are dependent on the chunk parameters. This is because the sampling is done within each chunk and any alteration to the chunk shape or size will rearrange the stream of random numbers within the array. Thus, even when the seed is set, a different chunkdim will yield different results:

set.seed(199)
RandomUnifArray(c(10, 5), chunkdim=c(1, 5))
## <10 x 5> RandomUnifMatrix object of type "double":
##               [,1]         [,2]         [,3]         [,4]         [,5]
##  [1,] 0.4432813900 0.7119991907 0.7257159729 0.9557158216 0.2104000037
##  [2,] 0.3134128384 0.0670862596 0.3807734565 0.1526811279 0.1971345709
##  [3,] 0.5525409468 0.7807079460 0.9905007351 0.0898367104 0.8541555854
##  [4,] 0.8914577304 0.3764661429 0.6028462166 0.2576166289 0.0145520803
##  [5,] 0.5042253651 0.2618340398 0.2405915416 0.6800763716 0.5603335327
##  [6,] 0.3953960796 0.1314800435 0.9008538304 0.1165704445 0.9035755624
##  [7,] 0.6270407308 0.0001234491 0.3542078000 0.3546805161 0.9760325255
##  [8,] 0.2882098067 0.9133890264 0.8018615395 0.7615402588 0.6458653482
##  [9,] 0.2233627134 0.2957882064 0.7371763836 0.3001928469 0.8730950402
## [10,] 0.6481108814 0.5491673953 0.6821353873 0.4434014931 0.1461723484
set.seed(199)
RandomUnifArray(c(10, 5), chunkdim=c(10, 1))
## <10 x 5> RandomUnifMatrix object of type "double":
##             [,1]       [,2]       [,3]       [,4]       [,5]
##  [1,] 0.44328139 0.31341284 0.55254095 0.89145773 0.50422537
##  [2,] 0.71199919 0.06708626 0.78070795 0.37646614 0.26183404
##  [3,] 0.72571597 0.38077346 0.99050074 0.60284622 0.24059154
##  [4,] 0.95571582 0.15268113 0.08983671 0.25761663 0.68007637
##  [5,] 0.21040000 0.19713457 0.85415559 0.01455208 0.56033353
##  [6,] 0.63126383 0.10118984 0.76830283 0.17583353 0.62183470
##  [7,] 0.87478047 0.26076972 0.58018989 0.92146566 0.90680388
##  [8,] 0.55180537 0.20464839 0.38008429 0.56606812 0.75101891
##  [9,] 0.40323090 0.20083487 0.28559824 0.67769322 0.11943279
## [10,] 0.13404760 0.94959186 0.61432837 0.67675354 0.64178865

Further comments

Like any other random process, the seed should be set to achieve reproducible results. We stress that the R-level seed only needs to be set before construction of the random DelayedArray; it is not necessary to set the seed during its use. This is because the class itself will define further seeds (one per chunk) and store these in the object for use in per-chunk sampling.

set.seed(999)
RandomNormArray(c(10, 5))
## <10 x 5> RandomNormMatrix object of type "double":
##              [,1]        [,2]        [,3]        [,4]        [,5]
##  [1,]  0.07299897 -0.62967402  0.48686043 -1.00406271  0.09839211
##  [2,] -1.87549362  0.11653501  0.98496315 -0.85515988 -1.44932474
##  [3,]  0.62150210 -0.24704984 -0.69766033 -1.60089511  0.94752723
##  [4,] -1.68035676  0.20939894 -3.29663662  0.57270346  0.94413669
##  [5,]  0.22933664  1.12333362 -0.77725398  1.59886411  1.07042538
##  [6,]  0.43082191 -0.22740387  1.25381398  0.32842239  0.20448699
##  [7,] -0.71424097  0.27773134  1.53748664  0.01761569  1.15470498
##  [8,]  0.35908991  0.35618136 -0.49375892 -0.38652760  0.95591577
##  [9,]  0.06472627  0.79333115 -0.73803086 -0.56434493  0.22473943
## [10,] -2.41389122  0.76273286  0.47567035 -0.08524682 -1.17849232
set.seed(999)
RandomNormArray(c(10, 5))
## <10 x 5> RandomNormMatrix object of type "double":
##              [,1]        [,2]        [,3]        [,4]        [,5]
##  [1,]  0.07299897 -0.62967402  0.48686043 -1.00406271  0.09839211
##  [2,] -1.87549362  0.11653501  0.98496315 -0.85515988 -1.44932474
##  [3,]  0.62150210 -0.24704984 -0.69766033 -1.60089511  0.94752723
##  [4,] -1.68035676  0.20939894 -3.29663662  0.57270346  0.94413669
##  [5,]  0.22933664  1.12333362 -0.77725398  1.59886411  1.07042538
##  [6,]  0.43082191 -0.22740387  1.25381398  0.32842239  0.20448699
##  [7,] -0.71424097  0.27773134  1.53748664  0.01761569  1.15470498
##  [8,]  0.35908991  0.35618136 -0.49375892 -0.38652760  0.95591577
##  [9,]  0.06472627  0.79333115 -0.73803086 -0.56434493  0.22473943
## [10,] -2.41389122  0.76273286  0.47567035 -0.08524682 -1.17849232

For certain distributions, it is possible to indicate that the array is sparse. This does not change the result or efficiency of the sampling process, but can still be useful as it allows downstream functions to use more efficient sparse algorithms. Of course, this is only relevant if the distributional parameters are such that sparsity is actually observed.

RandomPoisArray(c(1e6, 1e6), lambda=0.5) # dense by default
## <1000000 x 1000000> RandomPoisMatrix object of type "double":
##                  [,1]       [,2]       [,3] ...  [,999999] [,1000000]
##       [1,]          0          0          1   .          1          0
##       [2,]          0          1          1   .          0          1
##       [3,]          0          1          1   .          0          0
##       [4,]          0          1          0   .          1          0
##       [5,]          0          0          0   .          1          0
##        ...          .          .          .   .          .          .
##  [999996,]          1          0          0   .          1          1
##  [999997,]          0          1          0   .          1          1
##  [999998,]          2          0          1   .          0          0
##  [999999,]          1          0          1   .          0          1
## [1000000,]          0          0          0   .          1          0
RandomPoisArray(c(1e6, 1e6), lambda=0.5, sparse=TRUE) # treat as sparse
## <1000000 x 1000000> sparse RandomPoisMatrix object of type "double":
##                  [,1]       [,2]       [,3] ...  [,999999] [,1000000]
##       [1,]          0          1          1   .          1          0
##       [2,]          0          1          1   .          1          0
##       [3,]          0          1          1   .          0          0
##       [4,]          1          2          0   .          0          0
##       [5,]          0          1          0   .          0          2
##        ...          .          .          .   .          .          .
##  [999996,]          0          0          1   .          0          0
##  [999997,]          0          0          0   .          1          0
##  [999998,]          3          0          0   .          1          1
##  [999999,]          1          0          0   .          1          0
## [1000000,]          0          2          1   .          1          0

Session information

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DelayedRandomArray_1.15.0 DelayedArray_0.31.14     
##  [3] SparseArray_1.5.45        S4Arrays_1.5.11          
##  [5] IRanges_2.39.2            abind_1.4-8              
##  [7] S4Vectors_0.43.2          MatrixGenerics_1.17.1    
##  [9] matrixStats_1.4.1         BiocGenerics_0.53.0      
## [11] Matrix_1.7-1              BiocStyle_2.35.0         
## 
## loaded via a namespace (and not attached):
##  [1] crayon_1.5.3        cli_3.6.3           knitr_1.48         
##  [4] rlang_1.1.4         xfun_0.48           jsonlite_1.8.9     
##  [7] dqrng_0.4.1         buildtools_1.0.0    htmltools_0.5.8.1  
## [10] maketools_1.3.1     sys_3.4.3           sass_0.4.9         
## [13] rmarkdown_2.28      grid_4.4.1          evaluate_1.0.1     
## [16] jquerylib_0.1.4     fastmap_1.2.0       yaml_2.3.10        
## [19] lifecycle_1.0.4     BiocManager_1.30.25 compiler_4.4.1     
## [22] Rcpp_1.0.13         XVector_0.45.0      lattice_0.22-6     
## [25] digest_0.6.37       R6_2.5.1            bslib_0.8.0        
## [28] tools_4.4.1         zlibbioc_1.51.2     cachem_1.1.0