As the genetic/genomic data are having increasingly larger profile,
the annotation file are also getting much bigger than expected. the
memory space in R has been an obstable for fast and efficient
data processing, because most available R or
Bioconductor packages are developed based on in-memory data
manipulation. With some newly developed data structure as HDF5 or GDS, and the R
interface of DelayedArray
to represent on-disk data structures with different back-end in
R-user-friendly array data structure (e.g., HDF5Array,GDSArray), the
high-throughput genetic/genomic data are now being able to easily loaded
and manipulated within R. However, the annotation files for the
samples and features inside the high-through data are also getting
unexpectedly larger than before. With an ordinary
data.frame
or DataFrame
, it is still getting
more and more challenging for any analysis to be done within R.
So here we have developed the DelayedDataFrame
, which has
the very similar characteristics as data.frame
and
DataFrame
. But at the same time, all column data could be
optionally saved on-disk (e.g., in DelayedArray
structure with any back-end). Common operations like constructing,
subsetting, splitting, combining could be done in the same way as
DataFrame
. This feature of DelayedDataFrame
could enable efficient on-disk reading and processing of the large-scale
annotation files, and at the same, signicantly saves memory space with
common DataFrame
metaphor in R and
Bioconductor.
Download the package from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DelayedDataFrame")
The development version is also available to download through github:
Load the package into R session before using:
DelayedDataFrame
extends the DataFrame
data
structure, with an additional slot called lazyIndex
, which
saves all the mapping indexes for each column of the data inside
DelayedDataFrame
. It is similar to data.frame
in terms of construction, subsetting, splitting, combining… The
rownames
are having same feature as DataFrame
.
It will not be given automatically, but only by explicitly specify in
the constructor function DelayedDataFrame(, row.names=...)
or using the slot setter function rownames()<-
.
Here we use the GDSArray data as
example to show the DelayedDataFrame
characteristics. GDSArray is a
Bioconductor package that represents GDS files as objects
derived from the DelayedArray
package and DelayedArray
class. It carries the on-disk data
path and represent the GDS nodes in a DelayedArray
-derived
data structure.
The GDSArray()
constructor takes 2 arguments: the file
path and the GDS node name inside the GDS file.
library(GDSArray)
## Loading required package: gdsfmt
file <- SeqArray::seqExampleFileName("gds")
gdsnodes(file)
## [1] "sample.id" "variant.id"
## [3] "position" "chromosome"
## [5] "allele" "genotype/data"
## [7] "genotype/~data" "genotype/extra.index"
## [9] "genotype/extra" "phase/data"
## [11] "phase/~data" "phase/extra.index"
## [13] "phase/extra" "annotation/id"
## [15] "annotation/qual" "annotation/filter"
## [17] "annotation/info/AA" "annotation/info/AC"
## [19] "annotation/info/AN" "annotation/info/DP"
## [21] "annotation/info/HM2" "annotation/info/HM3"
## [23] "annotation/info/OR" "annotation/info/GP"
## [25] "annotation/info/BN" "annotation/format/DP/data"
## [27] "annotation/format/DP/~data" "sample.annotation/family"
varid <- GDSArray(file, "annotation/id")
DP <- GDSArray(file, "annotation/info/DP")
We use an ordinary character vector and the GDSArray
objects to construct a DelayedDataFrame
object.
The slots of DelayedDataFrame
could be accessed by
lazyIndex()
, nrow()
, rownames()
(if not NULL) functions. With a newly constructed
DelayedDataFrame
object, the initial value of
lazyIndex
slot will be NULL for all columns.
lazyIndex
slotThe lazyIndex
slot is in LazyIndex
class,
which is defined in the DelayedDataFrame
package and
extends the SimpleList
class. The listData
slot saves unique indexes for all the columns, and the
index
slots saves the position of index in
listData
slot for each column in
DelayedDataFrame
object. In the above example, with an
initial construction of DelayedDataFrame
object, the index
for each column will all be NULL, and all 3 columns points the NULL
values which sits in the first position in listData
slot of
lazyIndex
.
Whenever an operation is done (e.g., subsetting), the
listData
slot inside the DelayedDataFrame
stays the same, but the lazyIndex
slot will be updated, so
that the show method, further statistical calculation will be applied to
the subsetting data set. For example, here we subset the
DelayedDataFrame
object ddf
to keep only the
first 5 rows, and see how the lazyIndex
works. As shown in
below, after subsetting, the listData
slot in
ddf1
stays the same as ddf
. But the subsetting
operation was recorded in the lazyIndex
slot, and the slots
of lazyIndex
, nrows
and rownames
(if not NULL) are all updated. So the subsetting operation is kind of
delayed
.
ddf1 <- ddf[1:20,]
identical(ddf@listData, ddf1@listData)
## [1] TRUE
lazyIndex(ddf1)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## index of each column:
## [1] 1 1
nrow(ddf1)
## [1] 20
Only when functions like DataFrame()
, or
as.list()
, the lazyIndex
will be realized and
DelayedDataFrame
returned. We will show the realization in
the following coercion method section.
The common methods on data.frame
or
DataFrame
are also defined on DelayedDataFrame
class, so that they behave similarily on DelayedDataFrame
objects.
Coercion methods between DelayedDataFrame
and other data
structures are defined. When coercing from ANY
to
DelayedDataFrame
, the lazyIndex
slot will be
added automatically, with the initial NULL value of indexes for each
column.
as(letters, "DelayedDataFrame")
## DelayedDataFrame with 26 rows and 1 column
## X
## <character>
## 1 a
## 2 b
## 3 c
## ... ...
## 24 x
## 25 y
## 26 z
as(DataFrame(letters), "DelayedDataFrame")
## DelayedDataFrame with 26 rows and 1 column
## letters
## <character>
## 1 a
## 2 b
## 3 c
## ... ...
## 24 x
## 25 y
## 26 z
(a <- as(list(a=1:5, b=6:10), "DelayedDataFrame"))
## DelayedDataFrame with 5 rows and 2 columns
## a b
## <integer> <integer>
## 1 1 6
## 2 2 7
## 3 3 8
## 4 4 9
## 5 5 10
lazyIndex(a)
## LazyIndex of length 1
## [[1]]
## NULL
##
## index of each column:
## [1] 1 1
When coerce DelayedDataFrame
into other data structure,
the lazyIndex
slot will be realized and the new data
structure returned. For example, when DelayedDataFrame
is
coerced into a DataFrame
object, the listData
slot will be updated according to the lazyIndex
slot.
df1 <- as(ddf1, "DataFrame")
df1@listData
## $varid
## <20> DelayedArray object of type "character":
## [1] [2] [3] . [19]
## "rs111751804" "rs114390380" "rs1320571" . "rs61751002"
## [20]
## "rs6691840"
##
## $DP
## <20> DelayedArray object of type "integer":
## [1] [2] [3] [4] . [17] [18] [19] [20]
## 3251 2676 7610 3383 . 6040 6589 5089 6871
dim(df1)
## [1] 20 2
[
two-dimensional [
subsetting on
DelayedDataFrame
objects by integer, character, logical
values all work.
ddf[, 1, drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## varid
## <GDSArray>
## 1 rs111751804
## 2 rs114390380
## 3 rs1320571
## ... ...
## 1346 rs8135982
## 1347 rs116581756
## 1348 rs5771206
ddf[, "DP", drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## DP
## <GDSArray>
## 1 3251
## 2 2676
## 3 7610
## ... ...
## 1346 823
## 1347 1257
## 1348 48
ddf[, c(TRUE,FALSE), drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## varid
## <GDSArray>
## 1 rs111751804
## 2 rs114390380
## 3 rs1320571
## ... ...
## 1346 rs8135982
## 1347 rs116581756
## 1348 rs5771206
When subsetting using [
on an already subsetted
DelayedDataFrame
object, the lazyIndex
,
nrows
and rownames
(if not NULL) slot will be
updated.
(a <- ddf1[1:10, 2, drop=FALSE])
## DelayedDataFrame with 10 rows and 1 column
## DP
## <GDSArray>
## 1 3251
## 2 2676
## 3 7610
## ... ...
## 8 9076
## 9 9605
## 10 9707
lazyIndex(a)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## index of each column:
## [1] 1
nrow(a)
## [1] 10
[[
The [[
subsetting will take column subscripts for
integer or character values, and return corresponding columns in it’s
original data format.
ddf[[1]]
## <1348> GDSArray object of type "character":
## [1] [2] [3] . [1347]
## "rs111751804" "rs114390380" "rs1320571" . "rs116581756"
## [1348]
## "rs5771206"
ddf[["varid"]]
## <1348> GDSArray object of type "character":
## [1] [2] [3] . [1347]
## "rs111751804" "rs114390380" "rs1320571" . "rs116581756"
## [1348]
## "rs5771206"
identical(ddf[[1]], ddf[["varid"]])
## [1] TRUE
rbind/cbind
When doing rbind
, the lazyIndex
of input
arguments will be realized and a new DelayedDataFrame
with
NULL lazyIndex will be returned.
ddf2 <- ddf[21:40, ]
(ddfrb <- rbind(ddf1, ddf2))
## DelayedDataFrame with 40 rows and 2 columns
## varid DP
## <DelayedArray> <DelayedArray>
## 1 rs111751804 3251
## 2 rs114390380 2676
## 3 rs1320571 7610
## ... ... ...
## 38 rs1886116 3641
## 39 rs115917561 3089
## 40 rs61751016 4109
lazyIndex(ddfrb)
## LazyIndex of length 1
## [[1]]
## NULL
##
## index of each column:
## [1] 1 1
cbind
of DelayedDataFrame
objects will keep
all existing lazyIndex
of input arguments and carry into
the new DelayedDataFrame
object.
(ddfcb <- cbind(varid = ddf1[,1, drop=FALSE], DP=ddf1[, 2, drop=FALSE]))
## DelayedDataFrame with 20 rows and 2 columns
## varid DP.DP
## <GDSArray> <GDSArray>
## 1 rs111751804 3251
## 2 rs114390380 2676
## 3 rs1320571 7610
## ... ... ...
## 18 rs115614983 6589
## 19 rs61751002 5089
## 20 rs6691840 6871
lazyIndex(ddfcb)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## index of each column:
## [1] 1 1
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GDSArray_1.25.2 gdsfmt_1.41.2 DelayedDataFrame_1.23.0
## [4] DelayedArray_0.31.14 SparseArray_1.5.45 S4Arrays_1.5.11
## [7] abind_1.4-8 IRanges_2.39.2 MatrixGenerics_1.17.1
## [10] matrixStats_1.4.1 Matrix_1.7-1 S4Vectors_0.43.2
## [13] BiocGenerics_0.53.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] jsonlite_1.8.9 compiler_4.4.1 BiocManager_1.30.25
## [4] crayon_1.5.3 Biostrings_2.75.0 GenomicRanges_1.57.2
## [7] parallel_4.4.1 jquerylib_0.1.4 yaml_2.3.10
## [10] fastmap_1.2.0 lattice_0.22-6 R6_2.5.1
## [13] XVector_0.45.0 GenomeInfoDb_1.41.2 knitr_1.48
## [16] maketools_1.3.1 GenomeInfoDbData_1.2.13 bslib_0.8.0
## [19] rlang_1.1.4 cachem_1.1.0 xfun_0.48
## [22] sass_0.4.9 sys_3.4.3 cli_3.6.3
## [25] zlibbioc_1.51.2 digest_0.6.37 grid_4.4.1
## [28] SeqArray_1.45.4 lifecycle_1.0.4 evaluate_1.0.1
## [31] buildtools_1.0.0 httr_1.4.7 rmarkdown_2.28
## [34] UCSC.utils_1.1.0 tools_4.4.1 htmltools_0.5.8.1