Example Usage

Basics

Install DFplyr

DFplyr is a R package available via the Bioconductor repository for packages and can be downloaded via BiocManager::install():

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("DFplyr")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

Background

DFplyr is inspired by dplyr which implements a wide variety of common data manipulations (mutate, select, filter) but which only operates on objects of class data.frame or tibble (from r CRANpkg("tibble")).

When working with S4Vectors DataFrames - which are frequently used as components of, for example SummarizedExperiment objects - a common workaround is to convert the DataFrame to a tibble in order to then use dplyr functions to manipulate the contents, before converting back to a DataFrame.

This has several drawbacks, including the fact that tibble does not support rownames (and dplyr frequently does not preserve them), does not support S4 columns (e.g. IRanges vectors), and requires the back and forth transformation any time manipulation is desired.

Quick start to using DFplyr

library("DFplyr")

To being with, we create an S4Vectors DataFrame, including some S4 columns

library(S4Vectors)
m <- mtcars[, c("cyl", "hp", "am", "gear", "disp")]
d <- as(m, "DataFrame")
d$grX <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:32, width = 10))
#> Warning: multiple methods tables found for 'union'
#> Warning: multiple methods tables found for 'intersect'
#> Warning: multiple methods tables found for 'setdiff'
d$grY <- GenomicRanges::GRanges("chrY", IRanges::IRanges(1:32, width = 10))
d$nl <- IRanges::NumericList(lapply(d$gear, function(n) round(rnorm(n), 2)))
d
#> DataFrame with 32 rows and 8 columns
#>                         cyl        hp        am      gear      disp        grX
#>                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
#> Mazda RX4                 6       110         1         4       160  chrX:1-10
#> Mazda RX4 Wag             6       110         1         4       160  chrX:2-11
#> Datsun 710                4        93         1         4       108  chrX:3-12
#> Hornet 4 Drive            6       110         0         3       258  chrX:4-13
#> Hornet Sportabout         8       175         0         3       360  chrX:5-14
#> ...                     ...       ...       ...       ...       ...        ...
#> Lotus Europa              4       113         1         5      95.1 chrX:28-37
#> Ford Pantera L            8       264         1         5     351.0 chrX:29-38
#> Ferrari Dino              6       175         1         5     145.0 chrX:30-39
#> Maserati Bora             8       335         1         5     301.0 chrX:31-40
#> Volvo 142E                4       109         1         4     121.0 chrX:32-41
#>                          grY                      nl
#>                    <GRanges> <CompressedNumericList>
#> Mazda RX4          chrY:1-10    0.31,-0.19,-0.47,...
#> Mazda RX4 Wag      chrY:2-11   -0.52,-0.62,-0.19,...
#> Datsun 710         chrY:3-12   -0.80, 1.38,-0.87,...
#> Hornet 4 Drive     chrY:4-13       -1.05,-0.11,-1.64
#> Hornet Sportabout  chrY:5-14       -0.02, 0.77, 0.27
#> ...                      ...                     ...
#> Lotus Europa      chrY:28-37    1.54, 0.78,-0.76,...
#> Ford Pantera L    chrY:29-38    0.06,-0.40,-0.71,...
#> Ferrari Dino      chrY:30-39    1.58, 1.63,-0.01,...
#> Maserati Bora     chrY:31-40   -0.65,-0.72, 0.38,...
#> Volvo 142E        chrY:32-41   -0.85, 0.00, 0.26,...

This will appear in RStudio’s environment pane as a

Formal class DataFrame (dplyr-compatible)

when using DFplyr. No interference with the actual object is required, but this helps identify that dplyr-compatibility is available.

DataFrames can then be used in dplyr-like calls the same as data.frame or tibble objects. Support for working with S4 columns is enabled provided they have appropriate functions. Adding multiple columns will result in the new columns being created in alphabetical order. For example, adding a new column newvar which is the sum of the cyl and hp columns

mutate(d, newvar = cyl + hp)
#> DataFrame with 32 rows and 9 columns
#>                         cyl        hp        am      gear      disp        grX
#>                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
#> Mazda RX4                 6       110         1         4       160  chrX:1-10
#> Mazda RX4 Wag             6       110         1         4       160  chrX:2-11
#> Datsun 710                4        93         1         4       108  chrX:3-12
#> Hornet 4 Drive            6       110         0         3       258  chrX:4-13
#> Hornet Sportabout         8       175         0         3       360  chrX:5-14
#> ...                     ...       ...       ...       ...       ...        ...
#> Lotus Europa              4       113         1         5      95.1 chrX:28-37
#> Ford Pantera L            8       264         1         5     351.0 chrX:29-38
#> Ferrari Dino              6       175         1         5     145.0 chrX:30-39
#> Maserati Bora             8       335         1         5     301.0 chrX:31-40
#> Volvo 142E                4       109         1         4     121.0 chrX:32-41
#>                          grY                      nl    newvar
#>                    <GRanges> <CompressedNumericList> <numeric>
#> Mazda RX4          chrY:1-10    0.31,-0.19,-0.47,...       116
#> Mazda RX4 Wag      chrY:2-11   -0.52,-0.62,-0.19,...       116
#> Datsun 710         chrY:3-12   -0.80, 1.38,-0.87,...        97
#> Hornet 4 Drive     chrY:4-13       -1.05,-0.11,-1.64       116
#> Hornet Sportabout  chrY:5-14       -0.02, 0.77, 0.27       183
#> ...                      ...                     ...       ...
#> Lotus Europa      chrY:28-37    1.54, 0.78,-0.76,...       117
#> Ford Pantera L    chrY:29-38    0.06,-0.40,-0.71,...       272
#> Ferrari Dino      chrY:30-39    1.58, 1.63,-0.01,...       181
#> Maserati Bora     chrY:31-40   -0.65,-0.72, 0.38,...       343
#> Volvo 142E        chrY:32-41   -0.85, 0.00, 0.26,...       113

or doubling the nl column as nl2

mutate(d, nl2 = nl * 2)
#> DataFrame with 32 rows and 9 columns
#>                         cyl        hp        am      gear      disp        grX
#>                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
#> Mazda RX4                 6       110         1         4       160  chrX:1-10
#> Mazda RX4 Wag             6       110         1         4       160  chrX:2-11
#> Datsun 710                4        93         1         4       108  chrX:3-12
#> Hornet 4 Drive            6       110         0         3       258  chrX:4-13
#> Hornet Sportabout         8       175         0         3       360  chrX:5-14
#> ...                     ...       ...       ...       ...       ...        ...
#> Lotus Europa              4       113         1         5      95.1 chrX:28-37
#> Ford Pantera L            8       264         1         5     351.0 chrX:29-38
#> Ferrari Dino              6       175         1         5     145.0 chrX:30-39
#> Maserati Bora             8       335         1         5     301.0 chrX:31-40
#> Volvo 142E                4       109         1         4     121.0 chrX:32-41
#>                          grY                      nl                     nl2
#>                    <GRanges> <CompressedNumericList> <CompressedNumericList>
#> Mazda RX4          chrY:1-10    0.31,-0.19,-0.47,...    0.62,-0.38,-0.94,...
#> Mazda RX4 Wag      chrY:2-11   -0.52,-0.62,-0.19,...   -1.04,-1.24,-0.38,...
#> Datsun 710         chrY:3-12   -0.80, 1.38,-0.87,...   -1.60, 2.76,-1.74,...
#> Hornet 4 Drive     chrY:4-13       -1.05,-0.11,-1.64       -2.10,-0.22,-3.28
#> Hornet Sportabout  chrY:5-14       -0.02, 0.77, 0.27       -0.04, 1.54, 0.54
#> ...                      ...                     ...                     ...
#> Lotus Europa      chrY:28-37    1.54, 0.78,-0.76,...    3.08, 1.56,-1.52,...
#> Ford Pantera L    chrY:29-38    0.06,-0.40,-0.71,...    0.12,-0.80,-1.42,...
#> Ferrari Dino      chrY:30-39    1.58, 1.63,-0.01,...    3.16, 3.26,-0.02,...
#> Maserati Bora     chrY:31-40   -0.65,-0.72, 0.38,...   -1.30,-1.44, 0.76,...
#> Volvo 142E        chrY:32-41   -0.85, 0.00, 0.26,...   -1.70, 0.00, 0.52,...

or calculating the length() of the nl column cells as length_nl

mutate(d, length_nl = lengths(nl))
#> DataFrame with 32 rows and 9 columns
#>                         cyl        hp        am      gear      disp        grX
#>                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
#> Mazda RX4                 6       110         1         4       160  chrX:1-10
#> Mazda RX4 Wag             6       110         1         4       160  chrX:2-11
#> Datsun 710                4        93         1         4       108  chrX:3-12
#> Hornet 4 Drive            6       110         0         3       258  chrX:4-13
#> Hornet Sportabout         8       175         0         3       360  chrX:5-14
#> ...                     ...       ...       ...       ...       ...        ...
#> Lotus Europa              4       113         1         5      95.1 chrX:28-37
#> Ford Pantera L            8       264         1         5     351.0 chrX:29-38
#> Ferrari Dino              6       175         1         5     145.0 chrX:30-39
#> Maserati Bora             8       335         1         5     301.0 chrX:31-40
#> Volvo 142E                4       109         1         4     121.0 chrX:32-41
#>                          grY                      nl length_nl
#>                    <GRanges> <CompressedNumericList> <integer>
#> Mazda RX4          chrY:1-10    0.31,-0.19,-0.47,...         4
#> Mazda RX4 Wag      chrY:2-11   -0.52,-0.62,-0.19,...         4
#> Datsun 710         chrY:3-12   -0.80, 1.38,-0.87,...         4
#> Hornet 4 Drive     chrY:4-13       -1.05,-0.11,-1.64         3
#> Hornet Sportabout  chrY:5-14       -0.02, 0.77, 0.27         3
#> ...                      ...                     ...       ...
#> Lotus Europa      chrY:28-37    1.54, 0.78,-0.76,...         5
#> Ford Pantera L    chrY:29-38    0.06,-0.40,-0.71,...         5
#> Ferrari Dino      chrY:30-39    1.58, 1.63,-0.01,...         5
#> Maserati Bora     chrY:31-40   -0.65,-0.72, 0.38,...         5
#> Volvo 142E        chrY:32-41   -0.85, 0.00, 0.26,...         4

Transformations can involve S4-related functions, such as extracting the seqnames(), strand(), and end() of the grX column

mutate(d,
    chr = GenomeInfoDb::seqnames(grX),
    strand_X = BiocGenerics::strand(grX),
    end_X = BiocGenerics::end(grX)
)
#> DataFrame with 32 rows and 11 columns
#>                         cyl        hp        am      gear      disp        grX
#>                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
#> Mazda RX4                 6       110         1         4       160  chrX:1-10
#> Mazda RX4 Wag             6       110         1         4       160  chrX:2-11
#> Datsun 710                4        93         1         4       108  chrX:3-12
#> Hornet 4 Drive            6       110         0         3       258  chrX:4-13
#> Hornet Sportabout         8       175         0         3       360  chrX:5-14
#> ...                     ...       ...       ...       ...       ...        ...
#> Lotus Europa              4       113         1         5      95.1 chrX:28-37
#> Ford Pantera L            8       264         1         5     351.0 chrX:29-38
#> Ferrari Dino              6       175         1         5     145.0 chrX:30-39
#> Maserati Bora             8       335         1         5     301.0 chrX:31-40
#> Volvo 142E                4       109         1         4     121.0 chrX:32-41
#>                          grY                      nl   chr     end_X strand_X
#>                    <GRanges> <CompressedNumericList> <Rle> <integer>    <Rle>
#> Mazda RX4          chrY:1-10    0.31,-0.19,-0.47,...  chrX        10        *
#> Mazda RX4 Wag      chrY:2-11   -0.52,-0.62,-0.19,...  chrX        11        *
#> Datsun 710         chrY:3-12   -0.80, 1.38,-0.87,...  chrX        12        *
#> Hornet 4 Drive     chrY:4-13       -1.05,-0.11,-1.64  chrX        13        *
#> Hornet Sportabout  chrY:5-14       -0.02, 0.77, 0.27  chrX        14        *
#> ...                      ...                     ...   ...       ...      ...
#> Lotus Europa      chrY:28-37    1.54, 0.78,-0.76,...  chrX        37        *
#> Ford Pantera L    chrY:29-38    0.06,-0.40,-0.71,...  chrX        38        *
#> Ferrari Dino      chrY:30-39    1.58, 1.63,-0.01,...  chrX        39        *
#> Maserati Bora     chrY:31-40   -0.65,-0.72, 0.38,...  chrX        40        *
#> Volvo 142E        chrY:32-41   -0.85, 0.00, 0.26,...  chrX        41        *

the object returned remains a standard DataFrame, and further calls can be piped with %>%, in this case extracting the newly created newvar column

mutate(d, newvar = cyl + hp) %>%
    pull(newvar)
#>  [1] 116 116  97 116 183 111 253  66  99 129 129 188 188 188 213 223 238  70  56
#> [20]  69 101 158 158 253 183  70  95 117 272 181 343 113

Some of the variants of the dplyr verbs also work, such as transforming the numeric columns using a quosure style lambda function, in this case squaring them

mutate_if(d, is.numeric, ~ .^2)
#> DataFrame with 32 rows and 8 columns
#>                         cyl        hp        am      gear      disp        grX
#>                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
#> Mazda RX4                36     12100         1        16     25600  chrX:1-10
#> Mazda RX4 Wag            36     12100         1        16     25600  chrX:2-11
#> Datsun 710               16      8649         1        16     11664  chrX:3-12
#> Hornet 4 Drive           36     12100         0         9     66564  chrX:4-13
#> Hornet Sportabout        64     30625         0         9    129600  chrX:5-14
#> ...                     ...       ...       ...       ...       ...        ...
#> Lotus Europa             16     12769         1        25   9044.01 chrX:28-37
#> Ford Pantera L           64     69696         1        25 123201.00 chrX:29-38
#> Ferrari Dino             36     30625         1        25  21025.00 chrX:30-39
#> Maserati Bora            64    112225         1        25  90601.00 chrX:31-40
#> Volvo 142E               16     11881         1        16  14641.00 chrX:32-41
#>                          grY                      nl
#>                    <GRanges> <CompressedNumericList>
#> Mazda RX4          chrY:1-10    0.31,-0.19,-0.47,...
#> Mazda RX4 Wag      chrY:2-11   -0.52,-0.62,-0.19,...
#> Datsun 710         chrY:3-12   -0.80, 1.38,-0.87,...
#> Hornet 4 Drive     chrY:4-13       -1.05,-0.11,-1.64
#> Hornet Sportabout  chrY:5-14       -0.02, 0.77, 0.27
#> ...                      ...                     ...
#> Lotus Europa      chrY:28-37    1.54, 0.78,-0.76,...
#> Ford Pantera L    chrY:29-38    0.06,-0.40,-0.71,...
#> Ferrari Dino      chrY:30-39    1.58, 1.63,-0.01,...
#> Maserati Bora     chrY:31-40   -0.65,-0.72, 0.38,...
#> Volvo 142E        chrY:32-41   -0.85, 0.00, 0.26,...

or extracting the start of all of the "GRanges" columns

mutate_if(d, ~ isa(., "GRanges"), BiocGenerics::start)
#> DataFrame with 32 rows and 8 columns
#>                         cyl        hp        am      gear      disp       grX
#>                   <numeric> <numeric> <numeric> <numeric> <numeric> <integer>
#> Mazda RX4                 6       110         1         4       160         1
#> Mazda RX4 Wag             6       110         1         4       160         2
#> Datsun 710                4        93         1         4       108         3
#> Hornet 4 Drive            6       110         0         3       258         4
#> Hornet Sportabout         8       175         0         3       360         5
#> ...                     ...       ...       ...       ...       ...       ...
#> Lotus Europa              4       113         1         5      95.1        28
#> Ford Pantera L            8       264         1         5     351.0        29
#> Ferrari Dino              6       175         1         5     145.0        30
#> Maserati Bora             8       335         1         5     301.0        31
#> Volvo 142E                4       109         1         4     121.0        32
#>                         grY                      nl
#>                   <integer> <CompressedNumericList>
#> Mazda RX4                 1    0.31,-0.19,-0.47,...
#> Mazda RX4 Wag             2   -0.52,-0.62,-0.19,...
#> Datsun 710                3   -0.80, 1.38,-0.87,...
#> Hornet 4 Drive            4       -1.05,-0.11,-1.64
#> Hornet Sportabout         5       -0.02, 0.77, 0.27
#> ...                     ...                     ...
#> Lotus Europa             28    1.54, 0.78,-0.76,...
#> Ford Pantera L           29    0.06,-0.40,-0.71,...
#> Ferrari Dino             30    1.58, 1.63,-0.01,...
#> Maserati Bora            31   -0.65,-0.72, 0.38,...
#> Volvo 142E               32   -0.85, 0.00, 0.26,...

Use of tidyselect helpers is limited to within vars() calls and using the _at variants

mutate_at(d, vars(starts_with("c")), ~ .^2)
#> DataFrame with 32 rows and 8 columns
#>                         cyl        hp        am      gear      disp        grX
#>                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
#> Mazda RX4                36       110         1         4       160  chrX:1-10
#> Mazda RX4 Wag            36       110         1         4       160  chrX:2-11
#> Datsun 710               16        93         1         4       108  chrX:3-12
#> Hornet 4 Drive           36       110         0         3       258  chrX:4-13
#> Hornet Sportabout        64       175         0         3       360  chrX:5-14
#> ...                     ...       ...       ...       ...       ...        ...
#> Lotus Europa             16       113         1         5      95.1 chrX:28-37
#> Ford Pantera L           64       264         1         5     351.0 chrX:29-38
#> Ferrari Dino             36       175         1         5     145.0 chrX:30-39
#> Maserati Bora            64       335         1         5     301.0 chrX:31-40
#> Volvo 142E               16       109         1         4     121.0 chrX:32-41
#>                          grY                      nl
#>                    <GRanges> <CompressedNumericList>
#> Mazda RX4          chrY:1-10    0.31,-0.19,-0.47,...
#> Mazda RX4 Wag      chrY:2-11   -0.52,-0.62,-0.19,...
#> Datsun 710         chrY:3-12   -0.80, 1.38,-0.87,...
#> Hornet 4 Drive     chrY:4-13       -1.05,-0.11,-1.64
#> Hornet Sportabout  chrY:5-14       -0.02, 0.77, 0.27
#> ...                      ...                     ...
#> Lotus Europa      chrY:28-37    1.54, 0.78,-0.76,...
#> Ford Pantera L    chrY:29-38    0.06,-0.40,-0.71,...
#> Ferrari Dino      chrY:30-39    1.58, 1.63,-0.01,...
#> Maserati Bora     chrY:31-40   -0.65,-0.72, 0.38,...
#> Volvo 142E        chrY:32-41   -0.85, 0.00, 0.26,...

and also works with other verbs

select_at(d, vars(starts_with("gr")))
#> DataFrame with 32 rows and 2 columns
#>                          grX        grY
#>                    <GRanges>  <GRanges>
#> Mazda RX4          chrX:1-10  chrY:1-10
#> Mazda RX4 Wag      chrX:2-11  chrY:2-11
#> Datsun 710         chrX:3-12  chrY:3-12
#> Hornet 4 Drive     chrX:4-13  chrY:4-13
#> Hornet Sportabout  chrX:5-14  chrY:5-14
#> ...                      ...        ...
#> Lotus Europa      chrX:28-37 chrY:28-37
#> Ford Pantera L    chrX:29-38 chrY:29-38
#> Ferrari Dino      chrX:30-39 chrY:30-39
#> Maserati Bora     chrX:31-40 chrY:31-40
#> Volvo 142E        chrX:32-41 chrY:32-41

Importantly, grouped operations are supported. DataFrame does not natively support groups (the same way that data.frame does not) so these are implemented specifically for DFplyr with group information shown at the top of the printed output

group_by(d, cyl, am)
#> DataFrame with 32 rows and 8 columns
#> Groups:  cyl, am 
#>                         cyl        hp        am      gear      disp        grX
#>                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
#> Mazda RX4                 6       110         1         4       160  chrX:1-10
#> Mazda RX4 Wag             6       110         1         4       160  chrX:2-11
#> Datsun 710                4        93         1         4       108  chrX:3-12
#> Hornet 4 Drive            6       110         0         3       258  chrX:4-13
#> Hornet Sportabout         8       175         0         3       360  chrX:5-14
#> ...                     ...       ...       ...       ...       ...        ...
#> Lotus Europa              4       113         1         5      95.1 chrX:28-37
#> Ford Pantera L            8       264         1         5     351.0 chrX:29-38
#> Ferrari Dino              6       175         1         5     145.0 chrX:30-39
#> Maserati Bora             8       335         1         5     301.0 chrX:31-40
#> Volvo 142E                4       109         1         4     121.0 chrX:32-41
#>                          grY                      nl
#>                    <GRanges> <CompressedNumericList>
#> Mazda RX4          chrY:1-10    0.31,-0.19,-0.47,...
#> Mazda RX4 Wag      chrY:2-11   -0.52,-0.62,-0.19,...
#> Datsun 710         chrY:3-12   -0.80, 1.38,-0.87,...
#> Hornet 4 Drive     chrY:4-13       -1.05,-0.11,-1.64
#> Hornet Sportabout  chrY:5-14       -0.02, 0.77, 0.27
#> ...                      ...                     ...
#> Lotus Europa      chrY:28-37    1.54, 0.78,-0.76,...
#> Ford Pantera L    chrY:29-38    0.06,-0.40,-0.71,...
#> Ferrari Dino      chrY:30-39    1.58, 1.63,-0.01,...
#> Maserati Bora     chrY:31-40   -0.65,-0.72, 0.38,...
#> Volvo 142E        chrY:32-41   -0.85, 0.00, 0.26,...

Other verbs are similarly implemented, and preserve row names where possible. For example, selecting a limited set of columns using non-standard evaluation (NSE)

select(d, am, cyl)
#> DataFrame with 32 rows and 2 columns
#>                          am       cyl
#>                   <numeric> <numeric>
#> Mazda RX4                 1         6
#> Mazda RX4 Wag             1         6
#> Datsun 710                1         4
#> Hornet 4 Drive            0         6
#> Hornet Sportabout         0         8
#> ...                     ...       ...
#> Lotus Europa              1         4
#> Ford Pantera L            1         8
#> Ferrari Dino              1         6
#> Maserati Bora             1         8
#> Volvo 142E                1         4

Arranging rows according to the ordering of a column

arrange(d, desc(hp))
#> DataFrame with 32 rows and 8 columns
#>                         cyl        hp        am      gear      disp        grX
#>                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
#> Maserati Bora             8       335         1         5       301 chrX:31-40
#> Ford Pantera L            8       264         1         5       351 chrX:29-38
#> Duster 360                8       245         0         3       360  chrX:7-16
#> Camaro Z28                8       245         0         3       350 chrX:24-33
#> Chrysler Imperial         8       230         0         3       440 chrX:17-26
#> ...                     ...       ...       ...       ...       ...        ...
#> Fiat 128                  4        66         1         4      78.7 chrX:18-27
#> Fiat X1-9                 4        66         1         4      79.0 chrX:26-35
#> Toyota Corolla            4        65         1         4      71.1 chrX:20-29
#> Merc 240D                 4        62         0         4     146.7  chrX:8-17
#> Honda Civic               4        52         1         4      75.7 chrX:19-28
#>                          grY                      nl
#>                    <GRanges> <CompressedNumericList>
#> Maserati Bora     chrY:31-40   -0.65,-0.72, 0.38,...
#> Ford Pantera L    chrY:29-38    0.06,-0.40,-0.71,...
#> Duster 360         chrY:7-16        0.96, 1.15,-0.27
#> Camaro Z28        chrY:24-33       -1.35,-0.45, 0.96
#> Chrysler Imperial chrY:17-26          2.22,0.55,0.62
#> ...                      ...                     ...
#> Fiat 128          chrY:18-27    1.27,-0.29, 0.48,...
#> Fiat X1-9         chrY:26-35   -0.12,-0.59, 0.28,...
#> Toyota Corolla    chrY:20-29      1.57,1.01,0.42,...
#> Merc 240D          chrY:8-17    0.78,-0.87,-0.60,...
#> Honda Civic       chrY:19-28    0.21,-1.06, 0.98,...

Filtering to only specific values appearing in a column

filter(d, am == 0)
#> DataFrame with 19 rows and 8 columns
#>                         cyl        hp        am      gear      disp        grX
#>                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
#> Hornet 4 Drive            6       110         0         3     258.0  chrX:4-13
#> Hornet Sportabout         8       175         0         3     360.0  chrX:5-14
#> Valiant                   6       105         0         3     225.0  chrX:6-15
#> Duster 360                8       245         0         3     360.0  chrX:7-16
#> Merc 240D                 4        62         0         4     146.7  chrX:8-17
#> ...                     ...       ...       ...       ...       ...        ...
#> Toyota Corona             4        97         0         3     120.1 chrX:21-30
#> Dodge Challenger          8       150         0         3     318.0 chrX:22-31
#> AMC Javelin               8       150         0         3     304.0 chrX:23-32
#> Camaro Z28                8       245         0         3     350.0 chrX:24-33
#> Pontiac Firebird          8       175         0         3     400.0 chrX:25-34
#>                          grY                      nl
#>                    <GRanges> <CompressedNumericList>
#> Hornet 4 Drive     chrY:4-13       -1.05,-0.11,-1.64
#> Hornet Sportabout  chrY:5-14       -0.02, 0.77, 0.27
#> Valiant            chrY:6-15       -0.17, 0.40, 0.20
#> Duster 360         chrY:7-16        0.96, 1.15,-0.27
#> Merc 240D          chrY:8-17    0.78,-0.87,-0.60,...
#> ...                      ...                     ...
#> Toyota Corona     chrY:21-30        0.01,-1.29, 0.37
#> Dodge Challenger  chrY:22-31        0.78,-1.00, 1.57
#> AMC Javelin       chrY:23-32       -1.21, 0.59, 1.28
#> Camaro Z28        chrY:24-33       -1.35,-0.45, 0.96
#> Pontiac Firebird  chrY:25-34        0.97,-1.70, 0.83

Selecting specific rows by index

slice(d, 3:6)
#> DataFrame with 4 rows and 8 columns
#>                         cyl        hp        am      gear      disp       grX
#>                   <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Datsun 710                4        93         1         4       108 chrX:3-12
#> Hornet 4 Drive            6       110         0         3       258 chrX:4-13
#> Hornet Sportabout         8       175         0         3       360 chrX:5-14
#> Valiant                   6       105         0         3       225 chrX:6-15
#>                         grY                      nl
#>                   <GRanges> <CompressedNumericList>
#> Datsun 710        chrY:3-12   -0.80, 1.38,-0.87,...
#> Hornet 4 Drive    chrY:4-13       -1.05,-0.11,-1.64
#> Hornet Sportabout chrY:5-14       -0.02, 0.77, 0.27
#> Valiant           chrY:6-15       -0.17, 0.40, 0.20

These also work for grouped objects, and also preserve the rownames, e.g. selecting the first two rows from each group of gear

group_by(d, gear) %>%
    slice(1:2)
#> DataFrame with 6 rows and 8 columns
#> Groups:  gear 
#>                         cyl        hp        am      gear      disp        grX
#>                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
#> Hornet Sportabout         8       175         0         3     360.0  chrX:5-14
#> Merc 450SL                8       180         0         3     275.8 chrX:13-22
#> Mazda RX4                 6       110         1         4     160.0  chrX:1-10
#> Mazda RX4 Wag             6       110         1         4     160.0  chrX:2-11
#> Porsche 914-2             4        91         1         5     120.3 chrX:27-36
#> Ford Pantera L            8       264         1         5     351.0 chrX:29-38
#>                          grY                      nl
#>                    <GRanges> <CompressedNumericList>
#> Hornet Sportabout  chrY:5-14       -0.02, 0.77, 0.27
#> Merc 450SL        chrY:13-22        0.40,-0.39, 1.69
#> Mazda RX4          chrY:1-10    0.31,-0.19,-0.47,...
#> Mazda RX4 Wag      chrY:2-11   -0.52,-0.62,-0.19,...
#> Porsche 914-2     chrY:27-36   -2.59,-1.34, 1.53,...
#> Ford Pantera L    chrY:29-38    0.06,-0.40,-0.71,...

rename is itself renamed to rename2 due to conflicts between dplyr and S4Vectors, but works in the dplyr sense of taking new = old replacements with NSE syntax

select(d, am, cyl) %>%
    rename2(foo = am)
#> DataFrame with 32 rows and 2 columns
#>                         foo       cyl
#>                   <numeric> <numeric>
#> Mazda RX4                 1         6
#> Mazda RX4 Wag             1         6
#> Datsun 710                1         4
#> Hornet 4 Drive            0         6
#> Hornet Sportabout         0         8
#> ...                     ...       ...
#> Lotus Europa              1         4
#> Ford Pantera L            1         8
#> Ferrari Dino              1         6
#> Maserati Bora             1         8
#> Volvo 142E                1         4

Row names are not preserved when there may be duplicates or they don’t make sense, otherwise the first label (according to the current de-duplication method, in the case of distinct, this is via BiocGenerics::duplicated). This may have complications for S4 columns.

distinct(d)
#> DataFrame with 32 rows and 8 columns
#>                         cyl        hp        am      gear      disp        grX
#>                   <numeric> <numeric> <numeric> <numeric> <numeric>  <GRanges>
#> Mazda RX4                 6       110         1         4       160  chrX:1-10
#> Mazda RX4 Wag             6       110         1         4       160  chrX:2-11
#> Datsun 710                4        93         1         4       108  chrX:3-12
#> Hornet 4 Drive            6       110         0         3       258  chrX:4-13
#> Hornet Sportabout         8       175         0         3       360  chrX:5-14
#> ...                     ...       ...       ...       ...       ...        ...
#> Lotus Europa              4       113         1         5      95.1 chrX:28-37
#> Ford Pantera L            8       264         1         5     351.0 chrX:29-38
#> Ferrari Dino              6       175         1         5     145.0 chrX:30-39
#> Maserati Bora             8       335         1         5     301.0 chrX:31-40
#> Volvo 142E                4       109         1         4     121.0 chrX:32-41
#>                          grY                      nl
#>                    <GRanges> <CompressedNumericList>
#> Mazda RX4          chrY:1-10    0.31,-0.19,-0.47,...
#> Mazda RX4 Wag      chrY:2-11   -0.52,-0.62,-0.19,...
#> Datsun 710         chrY:3-12   -0.80, 1.38,-0.87,...
#> Hornet 4 Drive     chrY:4-13       -1.05,-0.11,-1.64
#> Hornet Sportabout  chrY:5-14       -0.02, 0.77, 0.27
#> ...                      ...                     ...
#> Lotus Europa      chrY:28-37    1.54, 0.78,-0.76,...
#> Ford Pantera L    chrY:29-38    0.06,-0.40,-0.71,...
#> Ferrari Dino      chrY:30-39    1.58, 1.63,-0.01,...
#> Maserati Bora     chrY:31-40   -0.65,-0.72, 0.38,...
#> Volvo 142E        chrY:32-41   -0.85, 0.00, 0.26,...

Behaviours are ideally the same as those of dplyr wherever possible, for example a grouped tally

group_by(d, cyl, am) %>%
    tally(gear)
#> DataFrame with 6 rows and 3 columns
#>         cyl        am         n
#>   <numeric> <numeric> <numeric>
#> 1         4         0        11
#> 2         4         1        34
#> 3         6         0        14
#> 4         6         1        13
#> 5         8         0        36
#> 6         8         1        10

or a count with weights

count(d, gear, am, cyl)
#> DataFrame with 10 rows and 4 columns
#>        gear    am   cyl         n
#>    <factor> <Rle> <Rle> <integer>
#> 1         3     0     4         1
#> 2         3     0     6         2
#> 3         3     0     8        12
#> 4         4     0     4         2
#> 5         4     0     6         2
#> 6         4     1     4         6
#> 7         4     1     6         2
#> 8         5     1     4         2
#> 9         5     1     6         1
#> 10        5     1     8         2

Citing DFplyr

We hope that DFplyr will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

citation("DFplyr")
#> Warning in citation("DFplyr"): could not determine year for 'DFplyr' from
#> package DESCRIPTION file
#> To cite package 'DFplyr' in publications use:
#> 
#>   Carroll J (????). _DFplyr: A `DataFrame` (`S4Vectors`) backend for
#>   `dplyr`_. R package version 1.1.0,
#>   <https://github.com/jonocarroll/DFplyr>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {DFplyr: A `DataFrame` (`S4Vectors`) backend for `dplyr`},
#>     author = {Jonathan Carroll},
#>     note = {R package version 1.1.0},
#>     url = {https://github.com/jonocarroll/DFplyr},
#>   }

Session Information.

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.2 (2024-10-31)
#>  os       Ubuntu 24.04.1 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  C
#>  ctype    en_US.UTF-8
#>  tz       Etc/UTC
#>  date     2024-11-18
#>  pandoc   3.2.1 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package          * version date (UTC) lib source
#>  BiocGenerics     * 0.53.3  2024-11-15 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  BiocManager        1.30.25 2024-08-28 [2] RSPM (R 4.4.0)
#>  BiocStyle        * 2.35.0  2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  bslib              0.8.0   2024-07-29 [2] RSPM (R 4.4.0)
#>  buildtools         1.0.0   2024-11-11 [3] local (/pkg)
#>  cachem             1.1.0   2024-05-16 [2] RSPM (R 4.4.0)
#>  cli                3.6.3   2024-06-21 [2] RSPM (R 4.4.0)
#>  DFplyr           * 1.1.0   2024-11-18 [1] https://bioc.r-universe.dev (R 4.4.2)
#>  digest             0.6.37  2024-08-19 [2] RSPM (R 4.4.0)
#>  dplyr            * 1.1.4   2023-11-17 [2] RSPM (R 4.4.0)
#>  evaluate           1.0.1   2024-10-10 [2] RSPM (R 4.4.0)
#>  fansi              1.0.6   2023-12-08 [2] RSPM (R 4.4.0)
#>  fastmap            1.2.0   2024-05-15 [2] RSPM (R 4.4.0)
#>  generics         * 0.1.3   2022-07-05 [2] RSPM (R 4.4.0)
#>  GenomeInfoDb       1.43.1  2024-11-18 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  GenomeInfoDbData   1.2.13  2024-11-18 [2] Bioconductor
#>  GenomicRanges      1.59.0  2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  glue               1.8.0   2024-09-30 [2] RSPM (R 4.4.0)
#>  htmltools          0.5.8.1 2024-04-04 [2] RSPM (R 4.4.0)
#>  httr               1.4.7   2023-08-15 [2] RSPM (R 4.4.0)
#>  IRanges            2.41.1  2024-11-17 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  jquerylib          0.1.4   2021-04-26 [2] RSPM (R 4.4.0)
#>  jsonlite           1.8.9   2024-09-20 [2] RSPM (R 4.4.0)
#>  knitr              1.49    2024-11-08 [2] RSPM (R 4.4.0)
#>  lifecycle          1.0.4   2023-11-07 [2] RSPM (R 4.4.0)
#>  magrittr           2.0.3   2022-03-30 [2] RSPM (R 4.4.0)
#>  maketools          1.3.1   2024-10-04 [3] RSPM (R 4.4.0)
#>  pillar             1.9.0   2023-03-22 [2] RSPM (R 4.4.0)
#>  pkgconfig          2.0.3   2019-09-22 [2] RSPM (R 4.4.0)
#>  R6                 2.5.1   2021-08-19 [2] RSPM (R 4.4.0)
#>  rlang              1.1.4   2024-06-04 [2] RSPM (R 4.4.0)
#>  rmarkdown          2.29    2024-11-04 [2] RSPM (R 4.4.0)
#>  S4Vectors        * 0.45.2  2024-11-16 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  sass               0.4.9   2024-03-15 [2] RSPM (R 4.4.0)
#>  sessioninfo        1.2.2   2021-12-06 [2] RSPM (R 4.4.0)
#>  sys                3.4.3   2024-10-04 [2] RSPM (R 4.4.0)
#>  tibble             3.2.1   2023-03-20 [2] RSPM (R 4.4.0)
#>  tidyselect         1.2.1   2024-03-11 [2] RSPM (R 4.4.0)
#>  UCSC.utils         1.3.0   2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  utf8               1.2.4   2023-10-22 [2] RSPM (R 4.4.0)
#>  vctrs              0.6.5   2023-12-01 [2] RSPM (R 4.4.0)
#>  withr              3.0.2   2024-10-28 [2] RSPM (R 4.4.0)
#>  xfun               0.49    2024-10-31 [2] RSPM (R 4.4.0)
#>  XVector            0.47.0  2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  yaml               2.3.10  2024-07-26 [2] RSPM (R 4.4.0)
#>  zlibbioc           1.52.0  2024-10-29 [2] Bioconductor 3.20 (R 4.4.2)
#> 
#>  [1] /tmp/Rtmps2ErbX/Rinste86196856ce
#>  [2] /github/workspace/pkglib
#>  [3] /usr/local/lib/R/site-library
#>  [4] /usr/lib/R/site-library
#>  [5] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────