The DelayedRandomArray
package implements DelayedArray
subclasses containing
dynamically sampled random values. Specifically, the actual values are
never fully held in memory but are generated when the relevant part of
the array is accessed. This allows users to create very large arrays of
random values that would not otherwise be possible by filling an
ordinary matrix.
To install the package, follow the instructions on DelayedRandomArray landing page. Using the package is then as simple as:
## <1000000 x 1000000> RandomUnifMatrix object of type "double":
## [,1] [,2] [,3] ... [,999999] [,1000000]
## [1,] 0.30192952 0.88189886 0.13789042 . 0.40605853 0.32756703
## [2,] 0.06198537 0.43109308 0.79817161 . 0.21343589 0.25680141
## [3,] 0.28092404 0.60931509 0.24974411 . 0.02046445 0.45452674
## [4,] 0.43079043 0.36502830 0.61924259 . 0.38151761 0.26443651
## [5,] 0.50665715 0.13282313 0.62808957 . 0.12936370 0.86747424
## ... . . . . . .
## [999996,] 0.6059753 0.2812328 0.2717718 . 0.3724968 0.1371885
## [999997,] 0.9122549 0.8393135 0.8549976 . 0.1990298 0.8783226
## [999998,] 0.9729627 0.9965181 0.9935182 . 0.3656284 0.5716273
## [999999,] 0.4187823 0.8620181 0.8629796 . 0.2573691 0.8334974
## [1000000,] 0.6381792 0.2430567 0.4198230 . 0.4852549 0.4704308
The resulting array can be used in any pipeline that is compatible
with DelayedArray
objects. This object occupies only 64 MB
in memory, whereas an ordinary matrix
would require 8 PB
instead.
Almost every distribution in stats is available here. To list a few:
## <100 x 50> RandomNormMatrix object of type "double":
## [,1] [,2] [,3] ... [,49] [,50]
## [1,] -0.91667165 -0.69230503 0.05242354 . 0.5323146 -1.0043177
## [2,] -0.42186933 -0.35945254 -1.29500539 . -0.3251134 0.1298343
## [3,] 0.11032688 -0.44091938 0.27883233 . -0.4697724 -1.9551868
## [4,] 1.06721755 -1.28149464 -0.66048405 . 1.2157237 -0.7733014
## [5,] 0.45875250 -0.14050526 -0.20262124 . 0.7450937 -1.0740956
## ... . . . . . .
## [96,] -0.57791742 -0.69093484 -0.59001301 . 1.4624963 0.3593126
## [97,] -1.17704209 0.96222037 0.14480865 . -1.2492447 -1.6800859
## [98,] 1.22562323 -0.09022698 1.42159070 . -1.1369267 1.2245239
## [99,] -1.64544696 0.55464486 0.35338749 . 0.6196536 -0.3266343
## [100,] 0.09280476 -0.37874489 -0.01315817 . 1.1694286 -0.7192306
## <100 x 50> RandomPoisMatrix object of type "double":
## [,1] [,2] [,3] ... [,49] [,50]
## [1,] 3 5 8 . 5 3
## [2,] 3 3 1 . 5 4
## [3,] 2 5 5 . 4 3
## [4,] 6 4 7 . 4 7
## [5,] 2 3 4 . 7 3
## ... . . . . . .
## [96,] 6 6 7 . 6 7
## [97,] 6 5 2 . 5 9
## [98,] 6 8 4 . 5 7
## [99,] 4 7 6 . 5 7
## [100,] 1 4 5 . 6 9
## <100 x 50> RandomGammaMatrix object of type "double":
## [,1] [,2] [,3] ... [,49] [,50]
## [1,] 0.56967293 0.11405118 0.69754905 . 0.1025578 0.4612366
## [2,] 0.22008639 1.18873787 0.28493253 . 0.2765653 0.1370805
## [3,] 1.69537812 0.58775621 0.06817447 . 0.4871070 0.1872452
## [4,] 0.24930370 0.20639910 0.28647266 . 0.1762756 0.7404448
## [5,] 0.23982925 0.08949021 0.18803156 . 0.5277130 0.1221830
## ... . . . . . .
## [96,] 0.3684005 0.1495722 0.3082390 . 0.32649142 0.55974516
## [97,] 0.3841857 0.3990208 0.1424009 . 0.10174332 0.26303415
## [98,] 0.7133285 0.1081969 0.1650089 . 0.69679479 0.03727454
## [99,] 0.4390049 0.5432918 0.2384737 . 0.77241241 0.53033346
## [100,] 0.2058354 0.7421198 0.7738918 . 0.05850282 0.25263103
## <100 x 50> RandomWeibullMatrix object of type "double":
## [,1] [,2] [,3] ... [,49] [,50]
## [1,] 0.3661683 0.5211680 0.9417071 . 0.8309481 0.9096787
## [2,] 0.9165740 0.5417126 0.8434901 . 0.6728498 1.0789552
## [3,] 0.9061442 1.0052422 1.2168514 . 0.8463653 0.7355764
## [4,] 0.8684932 0.9694717 0.7581332 . 0.9044083 1.1723951
## [5,] 1.0675663 0.9214778 0.8705093 . 1.0451657 0.8431389
## ... . . . . . .
## [96,] 0.9476413 1.2687862 1.0189272 . 0.5272767 1.1024288
## [97,] 1.0212187 1.1321577 0.6907849 . 0.9914368 1.0287482
## [98,] 0.4466846 1.0035882 1.0468357 . 0.8659832 1.1468748
## [99,] 0.8095865 1.0108933 1.1792327 . 0.4677391 0.9308053
## [100,] 0.6467069 0.6532956 0.9068343 . 0.8522492 0.6662050
Distributional parameters can either be scalars:
## <100 x 50> RandomNormMatrix object of type "double":
## [,1] [,2] [,3] ... [,49] [,50]
## [1,] 1.6271004 0.9417343 -0.8299566 . 0.3578880 2.8498174
## [2,] -0.5678926 0.3224457 0.1753453 . 1.7869522 1.2763049
## [3,] 0.4152065 -0.1537246 2.4184335 . 0.1351299 1.0607055
## [4,] 1.8116126 1.5771291 1.8320494 . 0.2546360 0.4677102
## [5,] 2.0131716 1.4647709 0.9376699 . 0.3229439 -0.6140739
## ... . . . . . .
## [96,] 0.67796728 2.65836630 0.23531014 . 2.0109627 2.2030995
## [97,] 1.26781258 -1.62733008 2.08079247 . 0.9406402 -0.6655255
## [98,] 0.90157681 -0.29780861 0.67364775 . 0.6262152 1.1458155
## [99,] 1.01781372 1.60730906 -0.02888258 . 1.0971746 1.5110251
## [100,] 0.94678510 2.36440205 0.66403493 . 1.5644434 0.7259310
Or vectors, which are recycled along the length of the array:
## <100 x 50> RandomNormMatrix object of type "double":
## [,1] [,2] [,3] ... [,49] [,50]
## [1,] 0.7858669 0.9526607 1.6461016 . 0.4983916 2.0423993
## [2,] 2.1839920 2.3660165 2.6697538 . 0.7944005 2.4216325
## [3,] 3.6516961 3.0425649 2.5307200 . 3.1758542 4.8473513
## [4,] 4.1546661 5.1142871 5.0060637 . 3.8286239 4.5313452
## [5,] 4.4629901 5.1857609 5.1476113 . 5.6448361 4.7955304
## ... . . . . . .
## [96,] 94.74345 97.65181 95.94134 . 95.63206 98.13411
## [97,] 95.88932 98.37645 96.85785 . 98.53561 96.77352
## [98,] 97.90016 98.29207 97.75492 . 96.35182 97.29571
## [99,] 97.80403 99.44498 99.26830 . 98.82555 98.86021
## [100,] 100.29459 99.83537 100.72975 . 98.53319 100.25292
Or other arrays of the same dimensions, which are used to sample the corresponding parts of the random array:
## <100 x 50> RandomPoisMatrix object of type "double":
## [,1] [,2] [,3] ... [,49] [,50]
## [1,] 0 2 0 . 2 0
## [2,] 2 0 1 . 7 1
## [3,] 0 0 1 . 1 0
## [4,] 0 0 0 . 2 0
## [5,] 1 3 0 . 0 0
## ... . . . . . .
## [96,] 1 4 1 . 0 3
## [97,] 1 1 0 . 0 0
## [98,] 3 0 1 . 2 1
## [99,] 1 3 2 . 0 2
## [100,] 0 3 0 . 0 1
For example, a hypothetical simulation of a million-cell single-cell RNA-seq dataset might look like this:
ngenes <- 20000
log.abundances <- runif(ngenes, -2, 5)
nclusters <- 20 # define 20 clusters and their population means.
cluster.means <- matrix(2^rnorm(ngenes*nclusters, log.abundances, sd=2), ncol=nclusters)
ncells <- 1e6
clusters <- sample(nclusters, ncells, replace=TRUE) # randomly allocate cells
cell.means <- DelayedArray(cluster.means)[,clusters]
dispersions <- 0.05 + 10/cell.means # typical mean variance trend.
y <- RandomNbinomArray(c(ngenes, ncells), mu=cell.means, size=1/dispersions)
y
## <20000 x 1000000> RandomNbinomMatrix object of type "double":
## [,1] [,2] [,3] ... [,999999] [,1000000]
## [1,] 2 78 1 . 80 5
## [2,] 0 0 0 . 0 0
## [3,] 1 1 0 . 1 0
## [4,] 0 3 0 . 4 7
## [5,] 0 0 2 . 0 0
## ... . . . . . .
## [19996,] 0 0 0 . 0 0
## [19997,] 44 5 0 . 23 1
## [19998,] 5 2 0 . 7 12
## [19999,] 0 25 0 . 0 0
## [20000,] 0 0 0 . 6 0
Each random DelayedArray
s is broken into contiguous
rectangular chunks of identical size and shape. Each chunk is assigned a
seed at construction time that is used to initialize a random number
stream (using the PCG32 generator from the dqrng package).
When the user accesses any part of the array, we generate the random
numbers in the overlapping chunks and return the desired values. This
provides efficient random access to any subarray without the need to use
any jump-ahead functionality.
The chunking scheme determines the efficiency of accessing our random
DelayedArray
s. Chunks that are too large require
unnecessary number generation when a subarray is requested, while chunks
that are too small would increase memory usage and book-keeping
overhead. The “best” choice also depends on the downstream access
pattern, if such information is known. For example, in a matrix where
each column is a chunk, retrieval of a column would be very efficient
while retrieval of a single row would be very slow. The default chunk
dimensions are set to the square root of the array dimensions (or 100,
whichever is larger), providing a reasonable compromise between all of
these considerations. This can also be manually specified with the
chunkdim=
argument.
## <1000 x 500> RandomUnifMatrix object of type "double":
## [,1] [,2] [,3] ... [,499] [,500]
## [1,] 0.75514814 0.09455797 0.67032830 . 0.5220478 0.8447784
## [2,] 0.18402828 0.14908845 0.16842666 . 0.3590315 0.6041378
## [3,] 0.88538697 0.98721098 0.80068777 . 0.3340073 0.3737720
## [4,] 0.86900542 0.13082417 0.13049999 . 0.2851445 0.9526614
## [5,] 0.83659679 0.24320085 0.65012324 . 0.8444787 0.0126758
## ... . . . . . .
## [996,] 0.4537080 0.3846833 0.1623065 . 0.16905022 0.99381300
## [997,] 0.4002372 0.3405630 0.3046205 . 0.68369134 0.18999845
## [998,] 0.8619803 0.2892201 0.2051199 . 0.52791046 0.28215433
## [999,] 0.1295508 0.9675237 0.7448341 . 0.24583872 0.77685991
## [1000,] 0.4823558 0.2244517 0.4766629 . 0.09814071 0.77033732
## <1000 x 500> RandomUnifMatrix object of type "double":
## [,1] [,2] [,3] ... [,499] [,500]
## [1,] 0.3975883 0.3099393 0.5339272 . 0.294558755 0.533025398
## [2,] 0.3412318 0.3867555 0.2244418 . 0.237083409 0.778487574
## [3,] 0.4834278 0.9753835 0.8250138 . 0.214358950 0.082422410
## [4,] 0.8813596 0.1279683 0.4750817 . 0.962084805 0.548938826
## [5,] 0.1925515 0.1801415 0.2139640 . 0.887157912 0.002877622
## ... . . . . . .
## [996,] 0.47078327 0.46148727 0.91205026 . 0.5999716 0.9957839
## [997,] 0.71811211 0.48143623 0.89618131 . 0.8833691 0.8853602
## [998,] 0.76673861 0.01677927 0.71901624 . 0.8982029 0.7241702
## [999,] 0.86805271 0.56156021 0.99636899 . 0.8826324 0.4823754
## [1000,] 0.63883150 0.38561844 0.04546727 . 0.1101447 0.4138880
Unlike other chunk-based DelayedArray
s, the actual
values of the random DelayedArray
are dependent on the
chunk parameters. This is because the sampling is done within each chunk
and any alteration to the chunk shape or size will rearrange the stream
of random numbers within the array. Thus, even when the seed is set, a
different chunkdim
will yield different results:
## <10 x 5> RandomUnifMatrix object of type "double":
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.4432813900 0.7119991907 0.7257159729 0.9557158216 0.2104000037
## [2,] 0.3134128384 0.0670862596 0.3807734565 0.1526811279 0.1971345709
## [3,] 0.5525409468 0.7807079460 0.9905007351 0.0898367104 0.8541555854
## [4,] 0.8914577304 0.3764661429 0.6028462166 0.2576166289 0.0145520803
## [5,] 0.5042253651 0.2618340398 0.2405915416 0.6800763716 0.5603335327
## [6,] 0.3953960796 0.1314800435 0.9008538304 0.1165704445 0.9035755624
## [7,] 0.6270407308 0.0001234491 0.3542078000 0.3546805161 0.9760325255
## [8,] 0.2882098067 0.9133890264 0.8018615395 0.7615402588 0.6458653482
## [9,] 0.2233627134 0.2957882064 0.7371763836 0.3001928469 0.8730950402
## [10,] 0.6481108814 0.5491673953 0.6821353873 0.4434014931 0.1461723484
## <10 x 5> RandomUnifMatrix object of type "double":
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.44328139 0.31341284 0.55254095 0.89145773 0.50422537
## [2,] 0.71199919 0.06708626 0.78070795 0.37646614 0.26183404
## [3,] 0.72571597 0.38077346 0.99050074 0.60284622 0.24059154
## [4,] 0.95571582 0.15268113 0.08983671 0.25761663 0.68007637
## [5,] 0.21040000 0.19713457 0.85415559 0.01455208 0.56033353
## [6,] 0.63126383 0.10118984 0.76830283 0.17583353 0.62183470
## [7,] 0.87478047 0.26076972 0.58018989 0.92146566 0.90680388
## [8,] 0.55180537 0.20464839 0.38008429 0.56606812 0.75101891
## [9,] 0.40323090 0.20083487 0.28559824 0.67769322 0.11943279
## [10,] 0.13404760 0.94959186 0.61432837 0.67675354 0.64178865
Like any other random process, the seed should be set to achieve
reproducible results. We stress that the R-level seed only needs to be
set before construction of the random
DelayedArray
; it is not necessary to set the seed during
its use. This is because the class itself will define further
seeds (one per chunk) and store these in the object for use in per-chunk
sampling.
## <10 x 5> RandomNormMatrix object of type "double":
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.07299897 -0.62967402 0.48686043 -1.00406271 0.09839211
## [2,] -1.87549362 0.11653501 0.98496315 -0.85515988 -1.44932474
## [3,] 0.62150210 -0.24704984 -0.69766033 -1.60089511 0.94752723
## [4,] -1.68035676 0.20939894 -3.29663662 0.57270346 0.94413669
## [5,] 0.22933664 1.12333362 -0.77725398 1.59886411 1.07042538
## [6,] 0.43082191 -0.22740387 1.25381398 0.32842239 0.20448699
## [7,] -0.71424097 0.27773134 1.53748664 0.01761569 1.15470498
## [8,] 0.35908991 0.35618136 -0.49375892 -0.38652760 0.95591577
## [9,] 0.06472627 0.79333115 -0.73803086 -0.56434493 0.22473943
## [10,] -2.41389122 0.76273286 0.47567035 -0.08524682 -1.17849232
## <10 x 5> RandomNormMatrix object of type "double":
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.07299897 -0.62967402 0.48686043 -1.00406271 0.09839211
## [2,] -1.87549362 0.11653501 0.98496315 -0.85515988 -1.44932474
## [3,] 0.62150210 -0.24704984 -0.69766033 -1.60089511 0.94752723
## [4,] -1.68035676 0.20939894 -3.29663662 0.57270346 0.94413669
## [5,] 0.22933664 1.12333362 -0.77725398 1.59886411 1.07042538
## [6,] 0.43082191 -0.22740387 1.25381398 0.32842239 0.20448699
## [7,] -0.71424097 0.27773134 1.53748664 0.01761569 1.15470498
## [8,] 0.35908991 0.35618136 -0.49375892 -0.38652760 0.95591577
## [9,] 0.06472627 0.79333115 -0.73803086 -0.56434493 0.22473943
## [10,] -2.41389122 0.76273286 0.47567035 -0.08524682 -1.17849232
For certain distributions, it is possible to indicate that the array is sparse. This does not change the result or efficiency of the sampling process, but can still be useful as it allows downstream functions to use more efficient sparse algorithms. Of course, this is only relevant if the distributional parameters are such that sparsity is actually observed.
## <1000000 x 1000000> RandomPoisMatrix object of type "double":
## [,1] [,2] [,3] ... [,999999] [,1000000]
## [1,] 0 0 1 . 1 0
## [2,] 0 1 1 . 0 1
## [3,] 0 1 1 . 0 0
## [4,] 0 1 0 . 1 0
## [5,] 0 0 0 . 1 0
## ... . . . . . .
## [999996,] 1 0 0 . 1 1
## [999997,] 0 1 0 . 1 1
## [999998,] 2 0 1 . 0 0
## [999999,] 1 0 1 . 0 1
## [1000000,] 0 0 0 . 1 0
## <1000000 x 1000000> sparse RandomPoisMatrix object of type "double":
## [,1] [,2] [,3] ... [,999999] [,1000000]
## [1,] 0 1 1 . 1 0
## [2,] 0 1 1 . 1 0
## [3,] 0 1 1 . 0 0
## [4,] 1 2 0 . 0 0
## [5,] 0 1 0 . 0 2
## ... . . . . . .
## [999996,] 0 0 1 . 0 0
## [999997,] 0 0 0 . 1 0
## [999998,] 3 0 0 . 1 1
## [999999,] 1 0 0 . 1 0
## [1000000,] 0 2 1 . 1 0
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] DelayedRandomArray_1.15.0 DelayedArray_0.31.14
## [3] SparseArray_1.5.45 S4Arrays_1.5.11
## [5] IRanges_2.39.2 abind_1.4-8
## [7] S4Vectors_0.43.2 MatrixGenerics_1.17.1
## [9] matrixStats_1.4.1 BiocGenerics_0.53.0
## [11] Matrix_1.7-1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] crayon_1.5.3 cli_3.6.3 knitr_1.48
## [4] rlang_1.1.4 xfun_0.48 jsonlite_1.8.9
## [7] dqrng_0.4.1 buildtools_1.0.0 htmltools_0.5.8.1
## [10] maketools_1.3.1 sys_3.4.3 sass_0.4.9
## [13] rmarkdown_2.28 grid_4.4.1 evaluate_1.0.1
## [16] jquerylib_0.1.4 fastmap_1.2.0 yaml_2.3.10
## [19] lifecycle_1.0.4 BiocManager_1.30.25 compiler_4.4.1
## [22] Rcpp_1.0.13 XVector_0.45.0 lattice_0.22-6
## [25] digest_0.6.37 R6_2.5.1 bslib_0.8.0
## [28] tools_4.4.1 zlibbioc_1.51.2 cachem_1.1.0