--- title: Using SQL-backed DataFrames author: - name: Qian Liu email: Qian.Liu@RoswellPark.org affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY - name: Aaron Lun affiliation: Genentech, Inc. - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY package: SQLDataFrame date: "Revised: February 5, 2024" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{User guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE} library(BiocStyle) self <- Biocpkg("SQLDataFrame"); knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE) ``` # Overview Firstly, I would like to extend my special thanks to Aaron Lun for his foundational work on the `ParquetDataFrame` package, and for his highly technical and detailed suggestions aimed at enhancing the current implementation of the [`SQLDataFrame` package](https://github.com/Bioconductor/SQLDataFrame/issues/8). Here I'm introducing the new version of `SQLDataFrame` for handling different SQL-backed files. Essentially, the implementation of `SQLDataFrame` is modeled upon `ParquetDataFrame` regarding its data structure, construction, documentation, and examples. This approach ensures the retension of best practices and maintains consistentcy in the use within Bioconductor ecosystem, thus simplifying the learning curve for users. The `SQLDataFrame`, as its name suggests, is a `DataFrame` where the columns are derived from data in a SQL table. This is fully file-backed so no data is actually loaded into memory until requested, allowing users to represent large datasets in limited memory. As the `SQLDataFrame` inherits from `r Biocpkg("S4Vectors")`' `DataFrame`, it can be used anywhere in Bioconductor's ecosystem that accepts a `DataFrame`, e.g., as the column data of a `SummarizedExperiment`, or inside a `BumpyDataFrameMatrix` from the `r Biocpkg("BumpyMatrix")` package. `SQLDataFrame` currently supports the following backends with their respective extension classes (and constructor functions): - `SQLite`: `SQLiteDataFrame` - `DuckDB`: `DuckDBDataFrame` It can be easily extended to any other SQL-backed file types by simply defining the extension classs in `SQL_extensions.R` with minor updates in `acquireConn.R` to create a database instance. Pull requests for adding new SQL backends are welcome! # Package installation 1. Download the package from Bioconductor. ```{r getPackage, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("SQLDataFrame") ``` 2. Load the package into R session. ```{r Load, message=FALSE} library(SQLDataFrame) ``` # Basic usage Given a path, database type and table name to a SQL file, we can construct a `SQLDataFrame`. If the backend is supported, we can construct an extension class directly. ```{r} ## Mocking up a file. tf <- tempfile() con <- DBI::dbConnect(RSQLite::SQLite(), tf) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) ## Creating the SQLiteDataFrame. library(SQLDataFrame) df <- SQLDataFrame(tf, dbtype = "sqlite", table = "mtcars") df0 <- SQLiteDataFrame(tf, table = "mtcars") identical(df, df0) ``` Similarly, we can create a `DuckDbDataFrame`: ```{r} tf1 <- tempfile() on.exit(unlist(tf1)) con <- DBI::dbConnect(duckdb::duckdb(), tf1) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) df1 <- SQLDataFrame(tf1, dbtype = "duckdb", table = "mtcars") df2 <- DuckDBDataFrame(tf1, table = "mtcars") identical(df1, df2) ``` These support all the usual methods for a `DataFrame`, except that the data is kept on file and referenced as needed: ```{r} nrow(df) colnames(df) class(as.data.frame(df)) ``` We extract individual columns as `SQLColumnVector` objects. These are 1-dimensional file-backed `DelayedArray`s that pull a single column's data from the SQL table on demand. ```{r} df$mpg # These can participate in usual vector operations: df$mpg * 10 log1p(df$mpg) # Realize this into an ordinary vector. as.vector(df$mpg) ``` # Collapsing to a `DFrame` The main goal of a `SQLDataFrame` is to serve as a consistent representation of the data inside a SQL table. However, this cannot be easily reconciled with many `DataFrame` operations that add or change data - at least, not without mutating the SQL file, which is outside the scope of the `SQLDataFrame` class. To handle such operations, the `SQLDataFrame` will collapse to a `DFrame` of `SQLColumnVector`s: ```{r} copy <- df copy$some_random_thing <- runif(nrow(df)) class(copy) colnames(copy) ``` This preserves the memory efficiency of file-backed data while supporting all `DataFrame` operations. For example, we can easily subset and mutate the various columns, which manifest as delayed operations inside each column. ```{r} copy$wt <- copy$wt * 1000 top.hits <- head(copy) top.hits ``` The fallback to `DFrame` ensures that a `SQLDataFrame` is interoperable with other Bioconductor data structures that need to perform arbitrary `DataFrame` operations. Of course, when a collapse occurs, we lose all guarantees that the in-memory representation is compatible with the underlying SQL table. This may preclude further optimizations in cases where we consider directly operating on the file. # Retrieving the SQL connection At any point, users can retrieve a handle of connection to the underlying SQL file via the `acquireConn()` function. This can be used with methods in the `r CRANpkg("DBI")`, `r CRANpkg("RSQLite")`, and `r CRANpkg("duckdb")` packages. The cached `DBIConnection` for any given `path` can be deleted by calling `releaseConn`. ```{r} handle <- acquireConn(path(df), dbtype = dbtype(df)) handle releaseConn(path(df)) ``` Note that the acquired handle will not capture any delayed subsetting/mutation operations that have been applied in the R session. In theory, it is possible to convert a subset of `r Biocpkg("DelayedArray")` operations into their `r CRANpkg("DBI")` equivalents, which would improve performance by avoiding the R interpreter when executing a query on the file. In practice, any performance boost tends to be rather fragile as only a subset of operations are supported, meaning that it is easy to silently fall back to R-based evaluation when an unsupported operation is executed. Users wanting to optimize query performance should just operate on the handle directly. # Session information {-} ```{r} sessionInfo() ```