Title: | Representation of SQL tables in DataFrame metaphor |
---|---|
Description: | Implements bindings for SQL tables that are compatible with Bioconductor S4 data structures, namely the DataFrame and DelayedArray. This allows SQL-derived data to be easily used inside other Bioconductor objects (e.g., SummarizedExperiments) while keeping everything on disk. |
Authors: | Qian Liu [aut, cre] , Aaron Lun [aut], Martin Morgan [aut] |
Maintainer: | Qian Liu <[email protected]> |
License: | LGPL (>= 3); File LICENSE |
Version: | 1.21.0 |
Built: | 2024-11-19 04:26:18 UTC |
Source: | https://github.com/bioc/SQLDataFrame |
Acquire a (possibly cached) SQL file connection given it's path.
acquireConn(path, dbtype = NULL) releaseConn(path)
acquireConn(path, dbtype = NULL) releaseConn(path)
path |
String containing a path to a SQL file. |
dbtype |
String containing the SQL database type (case insensitive). Supported types are "SQLite" and "DuckDB". |
acquireConn
will cache the DBIConnection object in
the current R session to avoid repeated initialization. This
improves efficiency for repeated calls, e.g., when creating a
DataFrame with multiple columns from the same SQL
table. The cached DBIConnection for any given path
can
be deleted by calling releaseConn
for the same
path
.
For acquireConn
, a DBIConnection with backends of
SQLite or DuckDB, which are identical to that returned by
DBI::dbConnect(RSQLite::SQLite(), path)
or
DBI::dbConnect(duckdb::duckdb(), path)
.
For releaseConn
, any existing DBIConnection for the
path
is disconnected and cleared from cache, and NULL
is invisibly returned. If path=NULL
, all cached connections
are removed.
Qian Liu
########### ## SQLite ########### ## Mocking up a file tf <- tempfile() on.exit(unlink(tf)) con <- DBI::dbConnect(RSQLite::SQLite(), tf) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) ## Acquire or release connection con <- acquireConn(tf, dbtype = "SQLite") acquireConn(tf, dbtype = "SQLite") # just re-uses the cache releaseConn(tf) # clears the cache ########### ## DuckDB ########### tf1 <- tempfile() on.exit(unlist(tf1)) con <- DBI::dbConnect(duckdb::duckdb(), tf1) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) con <- acquireConn(tf1, dbtype = "DuckDB") releaseConn(tf1)
########### ## SQLite ########### ## Mocking up a file tf <- tempfile() on.exit(unlink(tf)) con <- DBI::dbConnect(RSQLite::SQLite(), tf) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) ## Acquire or release connection con <- acquireConn(tf, dbtype = "SQLite") acquireConn(tf, dbtype = "SQLite") # just re-uses the cache releaseConn(tf) # clears the cache ########### ## DuckDB ########### tf1 <- tempfile() on.exit(unlist(tf1)) con <- DBI::dbConnect(duckdb::duckdb(), tf1) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) con <- acquireConn(tf1, dbtype = "DuckDB") releaseConn(tf1)
Represent a column of a SQL table as a 1-dimensional DelayedArray. This allows us to use SQL data inside DataFrames without loading them into memory.
SQLColumnSeed(path, dbtype, table, column, length = NULL, type = NULL) SQLColumnVector(x, ...)
SQLColumnSeed(path, dbtype, table, column, length = NULL, type = NULL) SQLColumnVector(x, ...)
path |
String containing a path to a SQL file. |
dbtype |
String containing the SQL database type (case insensitive). Supported types are "SQLite" and "DuckDB". |
table |
String containing the name of the table in SQL file. |
column |
String containing the name of the column inside the table. |
length |
Integer containing the number of rows. If
|
type |
String specifying the type of the data. If |
x |
A SQLColumnSeed object. |
... |
Further arguments to be passed to the
|
For SQLColumnSeed
: a SQLColumnSeed. For
SQLColumnVector
: a SQLColumnVector.
#'
Qian Liu
# Mocking up a file: tf <- tempfile() on.exit(unlink(tf)) con <- DBI::dbConnect(RSQLite::SQLite(), tf) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) # Creating a vector: SQLColumnVector(tf, dbtype = "SQLite", "mtcars", column="gear") # This happily lives inside DataFrames: collected <- list() for (x in colnames(mtcars)) { collected[[x]] <- SQLColumnVector(tf, dbtype = "SQLite", "mtcars", column=x) } DataFrame(collected) #'
# Mocking up a file: tf <- tempfile() on.exit(unlink(tf)) con <- DBI::dbConnect(RSQLite::SQLite(), tf) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) # Creating a vector: SQLColumnVector(tf, dbtype = "SQLite", "mtcars", column="gear") # This happily lives inside DataFrames: collected <- list() for (x in colnames(mtcars)) { collected[[x]] <- SQLColumnVector(tf, dbtype = "SQLite", "mtcars", column=x) } DataFrame(collected) #'
Create a SQL-backed DataFrame, where the data are
kept on disk until requested. Direct extension classes are
SQLiteDataFrame
and DuckDBDataFrame
.
SQLDataFrame(path, dbtype = NULL, table = NULL, columns = NULL, nrows = NULL)
SQLDataFrame(path, dbtype = NULL, table = NULL, columns = NULL, nrows = NULL)
path |
String containing a path to a SQL file. |
dbtype |
String containing the SQL database type (case insensitive). Supported types are "SQLite" and "DuckDB". |
table |
String containing the name of SQL table. |
columns |
Character vector containing the names of columns in
a SQL table. If |
nrows |
Integer scalar specifying the number of rows in a SQL
table. If |
The SQLDataFrame is essentially just a DataFrame of SQLColumnVector objects. It is primarily useful for indicating that the in-memory representation is consistent with the underlying SQL file (e.g., no delayed filter/mutate operations have been applied, no data has been added from other files). Thus, users can specialize code paths for a SQLDataFrame to operate directly on the underlying SQL table.
In that vein, operations on a SQLDataFrame may return another SQLDataFrame if the operation does not introduce inconsistencies with the file-backed data. For example, slicing or combining by column will return a SQLDataFrame as the contents of the retained columns are unchanged. In other cases, the SQLDataFrame will collapse to a regular DFrame of SQLColumnVector objects before applying the operation; these are still file-backed but lack the guarantee of file consistency.
A SQLDataFrame where each column is a SQLColumnVector.
Qian Liu
## Mocking up a file: ### SQLite tf <- tempfile() on.exit(unlink(tf)) con <- DBI::dbConnect(RSQLite::SQLite(), tf) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) ### DuckDB tf1 <- tempfile() on.exit(unlist(tf1)) con <- DBI::dbConnect(duckdb::duckdb(), tf1) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) ## Creating a SQLite-backed data frame: df <- SQLDataFrame(tf, dbtype = "SQLite", table = "mtcars") df1 <- SQLiteDataFrame(tf, "mtcars") identical(df, df1) ## DuckDB-backed data frame: df2 <- SQLDataFrame(tf1, dbtype = "duckdb", table = "mtcars") df3 <- DuckDBDataFrame(tf1, "mtcars") identical(df2, df3) ## Extraction yields a SQLiteColumnVector: df$carb ## Some operations preserve the SQLDataFrame: df[,1:5] combined <- cbind(df, df) class(combined) ## ... but most operations collapse to a regular DFrame: df[1:5,] combined2 <- cbind(df, some_new_name=df[,1]) class(combined2) df1 <- df rownames(df1) <- paste0("row", seq_len(nrow(df1))) class(df1) df2 <- df colnames(df2) <- letters[1:ncol(df2)] class(df2) df3 <- df df3$carb <- mtcars$carb class(df3) ## Utility functions path(df) dbtype(df) sqltable(df) dim(df) names(df) as.data.frame(df)
## Mocking up a file: ### SQLite tf <- tempfile() on.exit(unlink(tf)) con <- DBI::dbConnect(RSQLite::SQLite(), tf) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) ### DuckDB tf1 <- tempfile() on.exit(unlist(tf1)) con <- DBI::dbConnect(duckdb::duckdb(), tf1) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) ## Creating a SQLite-backed data frame: df <- SQLDataFrame(tf, dbtype = "SQLite", table = "mtcars") df1 <- SQLiteDataFrame(tf, "mtcars") identical(df, df1) ## DuckDB-backed data frame: df2 <- SQLDataFrame(tf1, dbtype = "duckdb", table = "mtcars") df3 <- DuckDBDataFrame(tf1, "mtcars") identical(df2, df3) ## Extraction yields a SQLiteColumnVector: df$carb ## Some operations preserve the SQLDataFrame: df[,1:5] combined <- cbind(df, df) class(combined) ## ... but most operations collapse to a regular DFrame: df[1:5,] combined2 <- cbind(df, some_new_name=df[,1]) class(combined2) df1 <- df rownames(df1) <- paste0("row", seq_len(nrow(df1))) class(df1) df2 <- df colnames(df2) <- letters[1:ncol(df2)] class(df2) df3 <- df df3$carb <- mtcars$carb class(df3) ## Utility functions path(df) dbtype(df) sqltable(df) dim(df) names(df) as.data.frame(df)
Extensions of SQLDataFrame, SQLColumnVector, SQLColumnSeed with different SQL backends. Currently supporting SQLite and DuckDB, with which the definition coding can be followed for added extension of other SQL backends.
path |
String containing a path to a SQL file. |
table |
String containing the name of the table in SQL file. |
column |
String containing the name of the column inside the table. |
length |
Integer containing the number of rows. If
|
type |
String specifying the type of the data. If |
## Mocking up a file: ### SQLite tf <- tempfile() on.exit(unlink(tf)) con <- DBI::dbConnect(RSQLite::SQLite(), tf) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) ### DuckDB tf1 <- tempfile() on.exit(unlist(tf1)) con <- DBI::dbConnect(duckdb::duckdb(), tf1) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) ## Constructor of xxColumnSeed and xxColumnVector sd <- SQLiteColumnSeed(tf, "mtcars", "gear") scv <- SQLiteColumnVector(sd) scv1 <- SQLiteColumnVector(tf, "mtcars", "gear") identical(scv, scv1) DuckDBColumnSeed(tf1, "mtcars", "mpg") DuckDBColumnVector(tf1, "mtcars", "mpg") ## Constructor of xxDataFrame SQLiteDataFrame(tf, "mtcars") DuckDBDataFrame(tf1, "mtcars")
## Mocking up a file: ### SQLite tf <- tempfile() on.exit(unlink(tf)) con <- DBI::dbConnect(RSQLite::SQLite(), tf) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) ### DuckDB tf1 <- tempfile() on.exit(unlist(tf1)) con <- DBI::dbConnect(duckdb::duckdb(), tf1) DBI::dbWriteTable(con, "mtcars", mtcars) DBI::dbDisconnect(con) ## Constructor of xxColumnSeed and xxColumnVector sd <- SQLiteColumnSeed(tf, "mtcars", "gear") scv <- SQLiteColumnVector(sd) scv1 <- SQLiteColumnVector(tf, "mtcars", "gear") identical(scv, scv1) DuckDBColumnSeed(tf1, "mtcars", "mpg") DuckDBColumnVector(tf1, "mtcars", "mpg") ## Constructor of xxDataFrame SQLiteDataFrame(tf, "mtcars") DuckDBDataFrame(tf1, "mtcars")