Firstly, I would like to extend my special thanks to Aaron Lun for
his foundational work on the ParquetDataFrame
package, and
for his highly technical and detailed suggestions aimed at enhancing the
current implementation of the SQLDataFrame
package. Here I’m introducing the new version of
SQLDataFrame
for handling different SQL-backed files.
Essentially, the implementation of SQLDataFrame
is modeled
upon ParquetDataFrame
regarding its data structure,
construction, documentation, and examples. This approach ensures the
retension of best practices and maintains consistentcy in the use within
Bioconductor ecosystem, thus simplifying the learning curve for
users.
The SQLDataFrame
, as its name suggests, is a
DataFrame
where the columns are derived from data in a SQL
table. This is fully file-backed so no data is actually loaded into
memory until requested, allowing users to represent large datasets in
limited memory. As the SQLDataFrame
inherits from S4Vectors’
DataFrame
, it can be used anywhere in Bioconductor’s
ecosystem that accepts a DataFrame
, e.g., as the column
data of a SummarizedExperiment
, or inside a
BumpyDataFrameMatrix
from the BumpyMatrix
package.
SQLDataFrame
currently supports the following backends
with their respective extension classes (and constructor functions):
SQLite
: SQLiteDataFrame
DuckDB
: DuckDBDataFrame
It can be easily extended to any other SQL-backed file types by
simply defining the extension classs in SQL_extensions.R
with minor updates in acquireConn.R
to create a database
instance. Pull requests for adding new SQL backends are welcome!
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SQLDataFrame")
Given a path, database type and table name to a SQL file, we can
construct a SQLDataFrame
. If the backend is supported, we
can construct an extension class directly.
## Mocking up a file.
tf <- tempfile()
con <- DBI::dbConnect(RSQLite::SQLite(), tf)
DBI::dbWriteTable(con, "mtcars", mtcars)
DBI::dbDisconnect(con)
## Creating the SQLiteDataFrame.
library(SQLDataFrame)
df <- SQLDataFrame(tf, dbtype = "sqlite", table = "mtcars")
df0 <- SQLiteDataFrame(tf, table = "mtcars")
identical(df, df0)
## [1] TRUE
Similarly, we can create a DuckDbDataFrame
:
tf1 <- tempfile()
on.exit(unlist(tf1))
con <- DBI::dbConnect(duckdb::duckdb(), tf1)
DBI::dbWriteTable(con, "mtcars", mtcars)
DBI::dbDisconnect(con)
df1 <- SQLDataFrame(tf1, dbtype = "duckdb", table = "mtcars")
df2 <- DuckDBDataFrame(tf1, table = "mtcars")
identical(df1, df2)
## [1] TRUE
These support all the usual methods for a DataFrame
,
except that the data is kept on file and referenced as needed:
## [1] 32
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
## [1] "data.frame"
We extract individual columns as SQLColumnVector
objects. These are 1-dimensional file-backed DelayedArray
s
that pull a single column’s data from the SQL table on demand.
## <32> SQLColumnVector object of type "double":
## [1] [2] [3] . [31] [32]
## 21.0 21.0 22.8 . 15.0 21.4
## <32> DelayedArray object of type "double":
## [1] [2] [3] . [31] [32]
## 210 210 228 . 150 214
## <32> DelayedArray object of type "double":
## [1] [2] [3] . [31] [32]
## 3.091042 3.091042 3.169686 . 2.772589 3.109061
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
DFrame
The main goal of a SQLDataFrame
is to serve as a
consistent representation of the data inside a SQL table. However, this
cannot be easily reconciled with many DataFrame
operations
that add or change data - at least, not without mutating the SQL file,
which is outside the scope of the SQLDataFrame
class. To
handle such operations, the SQLDataFrame
will collapse to a
DFrame
of SQLColumnVector
s:
## [1] "DFrame"
## attr(,"package")
## [1] "S4Vectors"
## [1] "mpg" "cyl" "disp"
## [4] "hp" "drat" "wt"
## [7] "qsec" "vs" "am"
## [10] "gear" "carb" "some_random_thing"
This preserves the memory efficiency of file-backed data while
supporting all DataFrame
operations. For example, we can
easily subset and mutate the various columns, which manifest as delayed
operations inside each column.
## DataFrame with 6 rows and 12 columns
## mpg cyl disp hp drat
## <DelayedArray> <DelayedArray> <DelayedArray> <DelayedArray> <DelayedArray>
## 1 21 6 160 110 3.9
## 2 21 6 160 110 3.9
## 3 22.8 4 108 93 3.85
## 4 21.4 6 258 110 3.08
## 5 18.7 8 360 175 3.15
## 6 18.1 6 225 105 2.76
## wt qsec vs am gear
## <DelayedArray> <DelayedArray> <DelayedArray> <DelayedArray> <DelayedArray>
## 1 2620 16.46 0 1 4
## 2 2875 17.02 0 1 4
## 3 2320 18.61 1 1 4
## 4 3215 19.44 1 0 3
## 5 3440 17.02 0 0 3
## 6 3460 20.22 1 0 3
## carb some_random_thing
## <DelayedArray> <numeric>
## 1 4 0.271621
## 2 4 0.519242
## 3 1 0.803451
## 4 1 0.607353
## 5 2 0.977282
## 6 1 0.288257
The fallback to DFrame
ensures that a
SQLDataFrame
is interoperable with other Bioconductor data
structures that need to perform arbitrary DataFrame
operations. Of course, when a collapse occurs, we lose all guarantees
that the in-memory representation is compatible with the underlying SQL
table. This may preclude further optimizations in cases where we
consider directly operating on the file.
At any point, users can retrieve a handle of connection to the
underlying SQL file via the acquireConn()
function. This
can be used with methods in the DBI, RSQLite, and
duckdb
packages. The cached DBIConnection
for any given
path
can be deleted by calling
releaseConn
.
## <SQLiteConnection>
## Path: /tmp/Rtmpwvvcsh/fileee581d9bd5
## Extensions: TRUE
Note that the acquired handle will not capture any delayed
subsetting/mutation operations that have been applied in the R session.
In theory, it is possible to convert a subset of
r Biocpkg("DelayedArray")
operations into their DBI equivalents,
which would improve performance by avoiding the R interpreter when
executing a query on the file.
In practice, any performance boost tends to be rather fragile as only a subset of operations are supported, meaning that it is easy to silently fall back to R-based evaluation when an unsupported operation is executed. Users wanting to optimize query performance should just operate on the handle directly.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SQLDataFrame_1.21.0 DelayedArray_0.33.2 SparseArray_1.7.2
## [4] S4Arrays_1.7.1 abind_1.4-8 IRanges_2.41.1
## [7] S4Vectors_0.45.2 MatrixGenerics_1.19.0 matrixStats_1.4.1
## [10] BiocGenerics_0.53.3 generics_0.1.3 Matrix_1.7-1
## [13] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] bit_4.5.0 jsonlite_1.8.9 compiler_4.4.2
## [4] BiocManager_1.30.25 crayon_1.5.3 blob_1.2.4
## [7] jquerylib_0.1.4 yaml_2.3.10 fastmap_1.2.0
## [10] lattice_0.22-6 R6_2.5.1 XVector_0.47.0
## [13] knitr_1.49 maketools_1.3.1 DBI_1.2.3
## [16] bslib_0.8.0 rlang_1.1.4 cachem_1.1.0
## [19] xfun_0.49 sass_0.4.9 sys_3.4.3
## [22] bit64_4.5.2 memoise_2.0.1 RSQLite_2.3.8
## [25] cli_3.6.3 zlibbioc_1.52.0 digest_0.6.37
## [28] grid_4.4.2 lifecycle_1.0.4 vctrs_0.6.5
## [31] evaluate_1.0.1 duckdb_1.1.2 buildtools_1.0.0
## [34] rmarkdown_2.29 pkgconfig_2.0.3 tools_4.4.2
## [37] htmltools_0.5.8.1