DFplyr
DFplyr is
a R
package available via the Bioconductor repository for packages
and can be downloaded via BiocManager::install()
:
DFplyr is
inspired by dplyr which
implements a wide variety of common data manipulations
(mutate
, select
, filter
) but
which only operates on objects of class data.frame
or
tibble
(from r CRANpkg("tibble")
).
When working with S4Vectors
DataFrame
s - which are frequently used as components of,
for example SummarizedExperiment
objects - a common workaround is to convert the DataFrame
to a tibble
in order to then use dplyr functions
to manipulate the contents, before converting back to a
DataFrame
.
This has several drawbacks, including the fact that
tibble
does not support rownames (and dplyr
frequently does not preserve them), does not support S4 columns
(e.g. IRanges
vectors), and requires the back and forth transformation any time
manipulation is desired.
DFplyr
To being with, we create an S4Vectors
DataFrame
, including some S4 columns
library(S4Vectors)
m <- mtcars[, c("cyl", "hp", "am", "gear", "disp")]
d <- as(m, "DataFrame")
d$grX <- GenomicRanges::GRanges("chrX", IRanges::IRanges(1:32, width = 10))
d$grY <- GenomicRanges::GRanges("chrY", IRanges::IRanges(1:32, width = 10))
d$nl <- IRanges::NumericList(lapply(d$gear, function(n) round(rnorm(n), 2)))
d
#> DataFrame with 32 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 6 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160 chrX:2-11
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 6 175 1 5 145.0 chrX:30-39
#> Maserati Bora 8 335 1 5 301.0 chrX:31-40
#> Volvo 142E 4 109 1 4 121.0 chrX:32-41
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Mazda RX4 chrY:1-10 0.58, 0.10,-0.74,...
#> Mazda RX4 Wag chrY:2-11 -0.26, 0.39,-1.36,...
#> Datsun 710 chrY:3-12 1.40, 1.54,-1.57,...
#> Hornet 4 Drive chrY:4-13 1.61,-0.55,-1.19
#> Hornet Sportabout chrY:5-14 -0.79, 1.69, 0.48
#> ... ... ...
#> Lotus Europa chrY:28-37 -0.24,-0.44,-1.20,...
#> Ford Pantera L chrY:29-38 -0.13, 0.28, 0.80,...
#> Ferrari Dino chrY:30-39 0.33,1.16,0.58,...
#> Maserati Bora chrY:31-40 0.17,-1.05, 0.13,...
#> Volvo 142E chrY:32-41 -0.45,-0.67, 0.70,...
This will appear in RStudio’s environment pane as a
Formal class DataFrame (dplyr-compatible)
when using DFplyr. No interference with the actual object is required, but this helps identify that dplyr-compatibility is available.
DataFrame
s can then be used in dplyr-like
calls the same as data.frame
or tibble
objects. Support for working with S4 columns is enabled provided they
have appropriate functions. Adding multiple columns will result in the
new columns being created in alphabetical order. For example, adding a
new column newvar
which is the sum of the cyl
and hp
columns
mutate(d, newvar = cyl + hp)
#> DataFrame with 32 rows and 9 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 6 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160 chrX:2-11
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 6 175 1 5 145.0 chrX:30-39
#> Maserati Bora 8 335 1 5 301.0 chrX:31-40
#> Volvo 142E 4 109 1 4 121.0 chrX:32-41
#> grY nl newvar
#> <GRanges> <CompressedNumericList> <numeric>
#> Mazda RX4 chrY:1-10 0.58, 0.10,-0.74,... 116
#> Mazda RX4 Wag chrY:2-11 -0.26, 0.39,-1.36,... 116
#> Datsun 710 chrY:3-12 1.40, 1.54,-1.57,... 97
#> Hornet 4 Drive chrY:4-13 1.61,-0.55,-1.19 116
#> Hornet Sportabout chrY:5-14 -0.79, 1.69, 0.48 183
#> ... ... ... ...
#> Lotus Europa chrY:28-37 -0.24,-0.44,-1.20,... 117
#> Ford Pantera L chrY:29-38 -0.13, 0.28, 0.80,... 272
#> Ferrari Dino chrY:30-39 0.33,1.16,0.58,... 181
#> Maserati Bora chrY:31-40 0.17,-1.05, 0.13,... 343
#> Volvo 142E chrY:32-41 -0.45,-0.67, 0.70,... 113
or doubling the nl
column as nl2
mutate(d, nl2 = nl * 2)
#> DataFrame with 32 rows and 9 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 6 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160 chrX:2-11
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 6 175 1 5 145.0 chrX:30-39
#> Maserati Bora 8 335 1 5 301.0 chrX:31-40
#> Volvo 142E 4 109 1 4 121.0 chrX:32-41
#> grY nl nl2
#> <GRanges> <CompressedNumericList> <CompressedNumericList>
#> Mazda RX4 chrY:1-10 0.58, 0.10,-0.74,... 1.16, 0.20,-1.48,...
#> Mazda RX4 Wag chrY:2-11 -0.26, 0.39,-1.36,... -0.52, 0.78,-2.72,...
#> Datsun 710 chrY:3-12 1.40, 1.54,-1.57,... 2.80, 3.08,-3.14,...
#> Hornet 4 Drive chrY:4-13 1.61,-0.55,-1.19 3.22,-1.10,-2.38
#> Hornet Sportabout chrY:5-14 -0.79, 1.69, 0.48 -1.58, 3.38, 0.96
#> ... ... ... ...
#> Lotus Europa chrY:28-37 -0.24,-0.44,-1.20,... -0.48,-0.88,-2.40,...
#> Ford Pantera L chrY:29-38 -0.13, 0.28, 0.80,... -0.26, 0.56, 1.60,...
#> Ferrari Dino chrY:30-39 0.33,1.16,0.58,... 0.66,2.32,1.16,...
#> Maserati Bora chrY:31-40 0.17,-1.05, 0.13,... 0.34,-2.10, 0.26,...
#> Volvo 142E chrY:32-41 -0.45,-0.67, 0.70,... -0.90,-1.34, 1.40,...
or calculating the length()
of the nl
column cells as length_nl
mutate(d, length_nl = lengths(nl))
#> DataFrame with 32 rows and 9 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 6 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160 chrX:2-11
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 6 175 1 5 145.0 chrX:30-39
#> Maserati Bora 8 335 1 5 301.0 chrX:31-40
#> Volvo 142E 4 109 1 4 121.0 chrX:32-41
#> grY nl length_nl
#> <GRanges> <CompressedNumericList> <integer>
#> Mazda RX4 chrY:1-10 0.58, 0.10,-0.74,... 4
#> Mazda RX4 Wag chrY:2-11 -0.26, 0.39,-1.36,... 4
#> Datsun 710 chrY:3-12 1.40, 1.54,-1.57,... 4
#> Hornet 4 Drive chrY:4-13 1.61,-0.55,-1.19 3
#> Hornet Sportabout chrY:5-14 -0.79, 1.69, 0.48 3
#> ... ... ... ...
#> Lotus Europa chrY:28-37 -0.24,-0.44,-1.20,... 5
#> Ford Pantera L chrY:29-38 -0.13, 0.28, 0.80,... 5
#> Ferrari Dino chrY:30-39 0.33,1.16,0.58,... 5
#> Maserati Bora chrY:31-40 0.17,-1.05, 0.13,... 5
#> Volvo 142E chrY:32-41 -0.45,-0.67, 0.70,... 4
Transformations can involve S4-related functions, such as extracting
the seqnames()
, strand()
, and
end()
of the grX
column
mutate(d,
chr = GenomeInfoDb::seqnames(grX),
strand_X = BiocGenerics::strand(grX),
end_X = BiocGenerics::end(grX)
)
#> DataFrame with 32 rows and 11 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 6 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160 chrX:2-11
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 6 175 1 5 145.0 chrX:30-39
#> Maserati Bora 8 335 1 5 301.0 chrX:31-40
#> Volvo 142E 4 109 1 4 121.0 chrX:32-41
#> grY nl chr end_X strand_X
#> <GRanges> <CompressedNumericList> <Rle> <integer> <Rle>
#> Mazda RX4 chrY:1-10 0.58, 0.10,-0.74,... chrX 10 *
#> Mazda RX4 Wag chrY:2-11 -0.26, 0.39,-1.36,... chrX 11 *
#> Datsun 710 chrY:3-12 1.40, 1.54,-1.57,... chrX 12 *
#> Hornet 4 Drive chrY:4-13 1.61,-0.55,-1.19 chrX 13 *
#> Hornet Sportabout chrY:5-14 -0.79, 1.69, 0.48 chrX 14 *
#> ... ... ... ... ... ...
#> Lotus Europa chrY:28-37 -0.24,-0.44,-1.20,... chrX 37 *
#> Ford Pantera L chrY:29-38 -0.13, 0.28, 0.80,... chrX 38 *
#> Ferrari Dino chrY:30-39 0.33,1.16,0.58,... chrX 39 *
#> Maserati Bora chrY:31-40 0.17,-1.05, 0.13,... chrX 40 *
#> Volvo 142E chrY:32-41 -0.45,-0.67, 0.70,... chrX 41 *
the object returned remains a standard DataFrame
, and
further calls can be piped with %>%
, in this case
extracting the newly created newvar
column
mutate(d, newvar = cyl + hp) %>%
pull(newvar)
#> [1] 116 116 97 116 183 111 253 66 99 129 129 188 188 188 213 223 238 70 56
#> [20] 69 101 158 158 253 183 70 95 117 272 181 343 113
Some of the variants of the dplyr
verbs also work, such
as transforming the numeric columns using a quosure style lambda
function, in this case squaring them
mutate_if(d, is.numeric, ~ .^2)
#> DataFrame with 32 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 36 12100 1 16 25600 chrX:1-10
#> Mazda RX4 Wag 36 12100 1 16 25600 chrX:2-11
#> Datsun 710 16 8649 1 16 11664 chrX:3-12
#> Hornet 4 Drive 36 12100 0 9 66564 chrX:4-13
#> Hornet Sportabout 64 30625 0 9 129600 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 16 12769 1 25 9044.01 chrX:28-37
#> Ford Pantera L 64 69696 1 25 123201.00 chrX:29-38
#> Ferrari Dino 36 30625 1 25 21025.00 chrX:30-39
#> Maserati Bora 64 112225 1 25 90601.00 chrX:31-40
#> Volvo 142E 16 11881 1 16 14641.00 chrX:32-41
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Mazda RX4 chrY:1-10 0.58, 0.10,-0.74,...
#> Mazda RX4 Wag chrY:2-11 -0.26, 0.39,-1.36,...
#> Datsun 710 chrY:3-12 1.40, 1.54,-1.57,...
#> Hornet 4 Drive chrY:4-13 1.61,-0.55,-1.19
#> Hornet Sportabout chrY:5-14 -0.79, 1.69, 0.48
#> ... ... ...
#> Lotus Europa chrY:28-37 -0.24,-0.44,-1.20,...
#> Ford Pantera L chrY:29-38 -0.13, 0.28, 0.80,...
#> Ferrari Dino chrY:30-39 0.33,1.16,0.58,...
#> Maserati Bora chrY:31-40 0.17,-1.05, 0.13,...
#> Volvo 142E chrY:32-41 -0.45,-0.67, 0.70,...
or extracting the start
of all of the
"GRanges"
columns
mutate_if(d, ~ isa(., "GRanges"), BiocGenerics::start)
#> DataFrame with 32 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <integer>
#> Mazda RX4 6 110 1 4 160 1
#> Mazda RX4 Wag 6 110 1 4 160 2
#> Datsun 710 4 93 1 4 108 3
#> Hornet 4 Drive 6 110 0 3 258 4
#> Hornet Sportabout 8 175 0 3 360 5
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 28
#> Ford Pantera L 8 264 1 5 351.0 29
#> Ferrari Dino 6 175 1 5 145.0 30
#> Maserati Bora 8 335 1 5 301.0 31
#> Volvo 142E 4 109 1 4 121.0 32
#> grY nl
#> <integer> <CompressedNumericList>
#> Mazda RX4 1 0.58, 0.10,-0.74,...
#> Mazda RX4 Wag 2 -0.26, 0.39,-1.36,...
#> Datsun 710 3 1.40, 1.54,-1.57,...
#> Hornet 4 Drive 4 1.61,-0.55,-1.19
#> Hornet Sportabout 5 -0.79, 1.69, 0.48
#> ... ... ...
#> Lotus Europa 28 -0.24,-0.44,-1.20,...
#> Ford Pantera L 29 -0.13, 0.28, 0.80,...
#> Ferrari Dino 30 0.33,1.16,0.58,...
#> Maserati Bora 31 0.17,-1.05, 0.13,...
#> Volvo 142E 32 -0.45,-0.67, 0.70,...
Use of tidyselect
helpers is limited to within vars()
calls and using the
_at
variants
mutate_at(d, vars(starts_with("c")), ~ .^2)
#> DataFrame with 32 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 36 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 36 110 1 4 160 chrX:2-11
#> Datsun 710 16 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 36 110 0 3 258 chrX:4-13
#> Hornet Sportabout 64 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 16 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 64 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 36 175 1 5 145.0 chrX:30-39
#> Maserati Bora 64 335 1 5 301.0 chrX:31-40
#> Volvo 142E 16 109 1 4 121.0 chrX:32-41
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Mazda RX4 chrY:1-10 0.58, 0.10,-0.74,...
#> Mazda RX4 Wag chrY:2-11 -0.26, 0.39,-1.36,...
#> Datsun 710 chrY:3-12 1.40, 1.54,-1.57,...
#> Hornet 4 Drive chrY:4-13 1.61,-0.55,-1.19
#> Hornet Sportabout chrY:5-14 -0.79, 1.69, 0.48
#> ... ... ...
#> Lotus Europa chrY:28-37 -0.24,-0.44,-1.20,...
#> Ford Pantera L chrY:29-38 -0.13, 0.28, 0.80,...
#> Ferrari Dino chrY:30-39 0.33,1.16,0.58,...
#> Maserati Bora chrY:31-40 0.17,-1.05, 0.13,...
#> Volvo 142E chrY:32-41 -0.45,-0.67, 0.70,...
and also works with other verbs
select_at(d, vars(starts_with("gr")))
#> DataFrame with 32 rows and 2 columns
#> grX grY
#> <GRanges> <GRanges>
#> Mazda RX4 chrX:1-10 chrY:1-10
#> Mazda RX4 Wag chrX:2-11 chrY:2-11
#> Datsun 710 chrX:3-12 chrY:3-12
#> Hornet 4 Drive chrX:4-13 chrY:4-13
#> Hornet Sportabout chrX:5-14 chrY:5-14
#> ... ... ...
#> Lotus Europa chrX:28-37 chrY:28-37
#> Ford Pantera L chrX:29-38 chrY:29-38
#> Ferrari Dino chrX:30-39 chrY:30-39
#> Maserati Bora chrX:31-40 chrY:31-40
#> Volvo 142E chrX:32-41 chrY:32-41
Importantly, grouped operations are supported. DataFrame
does not natively support groups (the same way that
data.frame
does not) so these are implemented specifically
for DFplyr
with group information shown at the top of the
printed output
group_by(d, cyl, am)
#> DataFrame with 32 rows and 8 columns
#> Groups: cyl, am
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 6 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160 chrX:2-11
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 6 175 1 5 145.0 chrX:30-39
#> Maserati Bora 8 335 1 5 301.0 chrX:31-40
#> Volvo 142E 4 109 1 4 121.0 chrX:32-41
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Mazda RX4 chrY:1-10 0.58, 0.10,-0.74,...
#> Mazda RX4 Wag chrY:2-11 -0.26, 0.39,-1.36,...
#> Datsun 710 chrY:3-12 1.40, 1.54,-1.57,...
#> Hornet 4 Drive chrY:4-13 1.61,-0.55,-1.19
#> Hornet Sportabout chrY:5-14 -0.79, 1.69, 0.48
#> ... ... ...
#> Lotus Europa chrY:28-37 -0.24,-0.44,-1.20,...
#> Ford Pantera L chrY:29-38 -0.13, 0.28, 0.80,...
#> Ferrari Dino chrY:30-39 0.33,1.16,0.58,...
#> Maserati Bora chrY:31-40 0.17,-1.05, 0.13,...
#> Volvo 142E chrY:32-41 -0.45,-0.67, 0.70,...
Other verbs are similarly implemented, and preserve row names where possible. For example, selecting a limited set of columns using non-standard evaluation (NSE)
select(d, am, cyl)
#> DataFrame with 32 rows and 2 columns
#> am cyl
#> <numeric> <numeric>
#> Mazda RX4 1 6
#> Mazda RX4 Wag 1 6
#> Datsun 710 1 4
#> Hornet 4 Drive 0 6
#> Hornet Sportabout 0 8
#> ... ... ...
#> Lotus Europa 1 4
#> Ford Pantera L 1 8
#> Ferrari Dino 1 6
#> Maserati Bora 1 8
#> Volvo 142E 1 4
Arranging rows according to the ordering of a column
arrange(d, desc(hp))
#> DataFrame with 32 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Maserati Bora 8 335 1 5 301 chrX:31-40
#> Ford Pantera L 8 264 1 5 351 chrX:29-38
#> Duster 360 8 245 0 3 360 chrX:7-16
#> Camaro Z28 8 245 0 3 350 chrX:24-33
#> Chrysler Imperial 8 230 0 3 440 chrX:17-26
#> ... ... ... ... ... ... ...
#> Fiat 128 4 66 1 4 78.7 chrX:18-27
#> Fiat X1-9 4 66 1 4 79.0 chrX:26-35
#> Toyota Corolla 4 65 1 4 71.1 chrX:20-29
#> Merc 240D 4 62 0 4 146.7 chrX:8-17
#> Honda Civic 4 52 1 4 75.7 chrX:19-28
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Maserati Bora chrY:31-40 0.17,-1.05, 0.13,...
#> Ford Pantera L chrY:29-38 -0.13, 0.28, 0.80,...
#> Duster 360 chrY:7-16 0.24,-0.51, 1.14
#> Camaro Z28 chrY:24-33 0.35,-0.03, 0.60
#> Chrysler Imperial chrY:17-26 -0.75, 0.09, 0.03
#> ... ... ...
#> Fiat 128 chrY:18-27 0.59, 0.36,-1.04,...
#> Fiat X1-9 chrY:26-35 -1.38, 0.18, 0.61,...
#> Toyota Corolla chrY:20-29 0.74,-0.68,-0.45,...
#> Merc 240D chrY:8-17 -0.58, 0.06,-1.30,...
#> Honda Civic chrY:19-28 0.74,0.68,0.40,...
Filtering to only specific values appearing in a column
filter(d, am == 0)
#> DataFrame with 19 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Hornet 4 Drive 6 110 0 3 258.0 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360.0 chrX:5-14
#> Valiant 6 105 0 3 225.0 chrX:6-15
#> Duster 360 8 245 0 3 360.0 chrX:7-16
#> Merc 240D 4 62 0 4 146.7 chrX:8-17
#> ... ... ... ... ... ... ...
#> Toyota Corona 4 97 0 3 120.1 chrX:21-30
#> Dodge Challenger 8 150 0 3 318.0 chrX:22-31
#> AMC Javelin 8 150 0 3 304.0 chrX:23-32
#> Camaro Z28 8 245 0 3 350.0 chrX:24-33
#> Pontiac Firebird 8 175 0 3 400.0 chrX:25-34
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Hornet 4 Drive chrY:4-13 1.61,-0.55,-1.19
#> Hornet Sportabout chrY:5-14 -0.79, 1.69, 0.48
#> Valiant chrY:6-15 -2.18,-0.78,-0.33
#> Duster 360 chrY:7-16 0.24,-0.51, 1.14
#> Merc 240D chrY:8-17 -0.58, 0.06,-1.30,...
#> ... ... ...
#> Toyota Corona chrY:21-30 1.51, 0.73,-0.52
#> Dodge Challenger chrY:22-31 1.47,-0.17,-0.38
#> AMC Javelin chrY:23-32 -0.82,-0.16, 0.43
#> Camaro Z28 chrY:24-33 0.35,-0.03, 0.60
#> Pontiac Firebird chrY:25-34 0.78,0.19,2.32
Selecting specific rows by index
slice(d, 3:6)
#> DataFrame with 4 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> Valiant 6 105 0 3 225 chrX:6-15
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Datsun 710 chrY:3-12 1.40, 1.54,-1.57,...
#> Hornet 4 Drive chrY:4-13 1.61,-0.55,-1.19
#> Hornet Sportabout chrY:5-14 -0.79, 1.69, 0.48
#> Valiant chrY:6-15 -2.18,-0.78,-0.33
These also work for grouped objects, and also preserve the rownames,
e.g. selecting the first two rows from each group of
gear
group_by(d, gear) %>%
slice(1:2)
#> DataFrame with 6 rows and 8 columns
#> Groups: gear
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Hornet Sportabout 8 175 0 3 360.0 chrX:5-14
#> Merc 450SL 8 180 0 3 275.8 chrX:13-22
#> Mazda RX4 6 110 1 4 160.0 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160.0 chrX:2-11
#> Porsche 914-2 4 91 1 5 120.3 chrX:27-36
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Hornet Sportabout chrY:5-14 -0.79, 1.69, 0.48
#> Merc 450SL chrY:13-22 0.48, 1.45,-0.25
#> Mazda RX4 chrY:1-10 0.58, 0.10,-0.74,...
#> Mazda RX4 Wag chrY:2-11 -0.26, 0.39,-1.36,...
#> Porsche 914-2 chrY:27-36 0.17, 0.24,-1.72,...
#> Ford Pantera L chrY:29-38 -0.13, 0.28, 0.80,...
rename
is itself renamed to rename2
due to
conflicts between dplyr and
S4Vectors,
but works in the dplyr sense of
taking new = old
replacements with NSE syntax
select(d, am, cyl) %>%
rename2(foo = am)
#> DataFrame with 32 rows and 2 columns
#> foo cyl
#> <numeric> <numeric>
#> Mazda RX4 1 6
#> Mazda RX4 Wag 1 6
#> Datsun 710 1 4
#> Hornet 4 Drive 0 6
#> Hornet Sportabout 0 8
#> ... ... ...
#> Lotus Europa 1 4
#> Ford Pantera L 1 8
#> Ferrari Dino 1 6
#> Maserati Bora 1 8
#> Volvo 142E 1 4
Row names are not preserved when there may be duplicates or they
don’t make sense, otherwise the first label (according to the current
de-duplication method, in the case of distinct
, this is via
BiocGenerics::duplicated
). This may have complications for
S4 columns.
distinct(d)
#> DataFrame with 32 rows and 8 columns
#> cyl hp am gear disp grX
#> <numeric> <numeric> <numeric> <numeric> <numeric> <GRanges>
#> Mazda RX4 6 110 1 4 160 chrX:1-10
#> Mazda RX4 Wag 6 110 1 4 160 chrX:2-11
#> Datsun 710 4 93 1 4 108 chrX:3-12
#> Hornet 4 Drive 6 110 0 3 258 chrX:4-13
#> Hornet Sportabout 8 175 0 3 360 chrX:5-14
#> ... ... ... ... ... ... ...
#> Lotus Europa 4 113 1 5 95.1 chrX:28-37
#> Ford Pantera L 8 264 1 5 351.0 chrX:29-38
#> Ferrari Dino 6 175 1 5 145.0 chrX:30-39
#> Maserati Bora 8 335 1 5 301.0 chrX:31-40
#> Volvo 142E 4 109 1 4 121.0 chrX:32-41
#> grY nl
#> <GRanges> <CompressedNumericList>
#> Mazda RX4 chrY:1-10 0.58, 0.10,-0.74,...
#> Mazda RX4 Wag chrY:2-11 -0.26, 0.39,-1.36,...
#> Datsun 710 chrY:3-12 1.40, 1.54,-1.57,...
#> Hornet 4 Drive chrY:4-13 1.61,-0.55,-1.19
#> Hornet Sportabout chrY:5-14 -0.79, 1.69, 0.48
#> ... ... ...
#> Lotus Europa chrY:28-37 -0.24,-0.44,-1.20,...
#> Ford Pantera L chrY:29-38 -0.13, 0.28, 0.80,...
#> Ferrari Dino chrY:30-39 0.33,1.16,0.58,...
#> Maserati Bora chrY:31-40 0.17,-1.05, 0.13,...
#> Volvo 142E chrY:32-41 -0.45,-0.67, 0.70,...
Behaviours are ideally the same as those of dplyr wherever possible, for example a grouped tally
group_by(d, cyl, am) %>%
tally(gear)
#> DataFrame with 6 rows and 3 columns
#> cyl am n
#> <numeric> <numeric> <numeric>
#> 1 4 0 11
#> 2 4 1 34
#> 3 6 0 14
#> 4 6 1 13
#> 5 8 0 36
#> 6 8 1 10
or a count with weights
count(d, gear, am, cyl)
#> DataFrame with 10 rows and 4 columns
#> gear am cyl n
#> <factor> <Rle> <Rle> <integer>
#> 1 3 0 4 1
#> 2 3 0 6 2
#> 3 3 0 8 12
#> 4 4 0 4 2
#> 5 4 0 6 2
#> 6 4 1 4 6
#> 7 4 1 6 2
#> 8 5 1 4 2
#> 9 5 1 6 1
#> 10 5 1 8 2
DFplyr
We hope that DFplyr will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!
citation("DFplyr")
#> Warning in citation("DFplyr"): could not determine year for 'DFplyr' from
#> package DESCRIPTION file
#> To cite package 'DFplyr' in publications use:
#>
#> Carroll J (????). _DFplyr: A `DataFrame` (`S4Vectors`) backend for
#> `dplyr`_. R package version 1.1.0,
#> <https://github.com/jonocarroll/DFplyr>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {DFplyr: A `DataFrame` (`S4Vectors`) backend for `dplyr`},
#> author = {Jonathan Carroll},
#> note = {R package version 1.1.0},
#> url = {https://github.com/jonocarroll/DFplyr},
#> }
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.2 (2024-10-31)
#> os Ubuntu 24.04.1 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate C
#> ctype en_US.UTF-8
#> tz Etc/UTC
#> date 2024-12-18
#> pandoc 3.2.1 @ /usr/local/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> BiocGenerics * 0.53.3 2024-12-15 [2] https://bioc.r-universe.dev (R 4.4.2)
#> BiocManager 1.30.25 2024-08-28 [2] RSPM (R 4.4.0)
#> BiocStyle * 2.35.0 2024-11-19 [2] https://bioc.r-universe.dev (R 4.4.2)
#> bslib 0.8.0 2024-07-29 [2] RSPM (R 4.4.0)
#> buildtools 1.0.0 2024-12-16 [3] local (/pkg)
#> cachem 1.1.0 2024-05-16 [2] RSPM (R 4.4.0)
#> cli 3.6.3 2024-06-21 [2] RSPM (R 4.4.0)
#> DFplyr * 1.1.0 2024-12-18 [1] https://bioc.r-universe.dev (R 4.4.2)
#> digest 0.6.37 2024-08-19 [2] RSPM (R 4.4.0)
#> dplyr * 1.1.4 2023-11-17 [2] RSPM (R 4.4.0)
#> evaluate 1.0.1 2024-10-10 [2] RSPM (R 4.4.0)
#> fastmap 1.2.0 2024-05-15 [2] RSPM (R 4.4.0)
#> generics * 0.1.3 2022-07-05 [2] RSPM (R 4.4.0)
#> GenomeInfoDb 1.43.2 2024-11-28 [2] https://bioc.r-universe.dev (R 4.4.2)
#> GenomeInfoDbData 1.2.13 2024-12-18 [2] Bioconductor
#> GenomicRanges 1.59.1 2024-11-19 [2] https://bioc.r-universe.dev (R 4.4.2)
#> glue 1.8.0 2024-09-30 [2] RSPM (R 4.4.0)
#> htmltools 0.5.8.1 2024-04-04 [2] RSPM (R 4.4.0)
#> httr 1.4.7 2023-08-15 [2] RSPM (R 4.4.0)
#> IRanges 2.41.2 2024-12-03 [2] https://bioc.r-universe.dev (R 4.4.2)
#> jquerylib 0.1.4 2021-04-26 [2] RSPM (R 4.4.0)
#> jsonlite 1.8.9 2024-09-20 [2] RSPM (R 4.4.0)
#> knitr 1.49 2024-11-08 [2] RSPM (R 4.4.0)
#> lifecycle 1.0.4 2023-11-07 [2] RSPM (R 4.4.0)
#> magrittr 2.0.3 2022-03-30 [2] RSPM (R 4.4.0)
#> maketools 1.3.1 2024-10-04 [3] RSPM (R 4.4.0)
#> pillar 1.10.0 2024-12-17 [2] CRAN (R 4.4.2)
#> pkgconfig 2.0.3 2019-09-22 [2] RSPM (R 4.4.0)
#> R6 2.5.1 2021-08-19 [2] RSPM (R 4.4.0)
#> rlang 1.1.4 2024-06-04 [2] RSPM (R 4.4.0)
#> rmarkdown 2.29 2024-11-04 [2] RSPM (R 4.4.0)
#> S4Vectors * 0.45.2 2024-12-16 [2] https://bioc.r-universe.dev (R 4.4.2)
#> sass 0.4.9 2024-03-15 [2] RSPM (R 4.4.0)
#> sessioninfo 1.2.2 2021-12-06 [2] RSPM (R 4.4.0)
#> sys 3.4.3 2024-10-04 [2] RSPM (R 4.4.0)
#> tibble 3.2.1 2023-03-20 [2] RSPM (R 4.4.0)
#> tidyselect 1.2.1 2024-03-11 [2] RSPM (R 4.4.0)
#> UCSC.utils 1.3.0 2024-11-30 [2] https://bioc.r-universe.dev (R 4.4.2)
#> vctrs 0.6.5 2023-12-01 [2] RSPM (R 4.4.0)
#> withr 3.0.2 2024-10-28 [2] RSPM (R 4.4.0)
#> xfun 0.49 2024-10-31 [2] RSPM (R 4.4.0)
#> XVector 0.47.0 2024-11-21 [2] https://bioc.r-universe.dev (R 4.4.2)
#> yaml 2.3.10 2024-07-26 [2] RSPM (R 4.4.0)
#> zlibbioc 1.52.0 2024-10-29 [2] Bioconductor 3.20 (R 4.4.2)
#>
#> [1] /tmp/RtmpencNT5/Rinstefa302864e0
#> [2] /github/workspace/pkglib
#> [3] /usr/local/lib/R/site-library
#> [4] /usr/lib/R/site-library
#> [5] /usr/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────