Title: | Brings SummarizedExperiment to the Tidyverse |
---|---|
Description: | The tidySummarizedExperiment package provides a set of tools for creating and manipulating tidy data representations of SummarizedExperiment objects. SummarizedExperiment is a widely used data structure in bioinformatics for storing high-throughput genomic data, such as gene expression or DNA sequencing data. The tidySummarizedExperiment package introduces a tidy framework for working with SummarizedExperiment objects. It allows users to convert their data into a tidy format, where each observation is a row and each variable is a column. This tidy representation simplifies data manipulation, integration with other tidyverse packages, and enables seamless integration with the broader ecosystem of tidy tools for data analysis. |
Authors: | Stefano Mangiola [aut, cre] |
Maintainer: | Stefano Mangiola <[email protected]> |
License: | GPL-3 |
Version: | 1.17.0 |
Built: | 2024-11-18 04:40:22 UTC |
Source: | https://github.com/bioc/tidySummarizedExperiment |
as_tibble()
turns an existing object, such as a data frame or
matrix, into a so-called tibble, a data frame with class tbl_df
. This is
in contrast with tibble()
, which builds a tibble from individual columns.
as_tibble()
is to tibble()
as base::as.data.frame()
is to
base::data.frame()
.
as_tibble()
is an S3 generic, with methods for:
data.frame
: Thin wrapper around the list
method
that implements tibble's treatment of rownames.
Default: Other inputs are first coerced with base::as.data.frame()
.
as_tibble_row()
converts a vector to a tibble with one row.
If the input is a list, all elements must have size one.
as_tibble_col()
converts a vector to a tibble with one column.
## S3 method for class 'SummarizedExperiment' as_tibble( x, ..., .name_repair = c("check_unique", "unique", "universal", "minimal"), rownames = pkgconfig::get_config("tibble::rownames", NULL) )
## S3 method for class 'SummarizedExperiment' as_tibble( x, ..., .name_repair = c("check_unique", "unique", "universal", "minimal"), rownames = pkgconfig::get_config("tibble::rownames", NULL) )
x |
A data frame, list, matrix, or other object that could reasonably be coerced to a tibble. |
... |
Unused, for extensibility. |
.name_repair |
Treatment of problematic column names:
This argument is passed on as |
rownames |
How to treat existing row names of a data frame or matrix:
Read more in rownames. |
tibble
The default behavior is to silently remove row names.
New code should explicitly convert row names to a new column using the
rownames
argument.
For existing code that relies on the retention of row names, call
pkgconfig::set_config("tibble::rownames" = NA)
in your script or in your
package's .onLoad()
function.
Using as_tibble()
for vectors is superseded as of version 3.0.0,
prefer the more expressive as_tibble_row()
and
as_tibble_col()
variants for new code.
tibble()
constructs a tibble from individual columns. enframe()
converts a named vector to a tibble with a column of names and column of
values. Name repair is implemented using vctrs::vec_as_names()
.
tidySummarizedExperiment::pasilla %>% as_tibble() tidySummarizedExperiment::pasilla %>% as_tibble(.subset=-c(condition, type))
tidySummarizedExperiment::pasilla %>% as_tibble() tidySummarizedExperiment::pasilla %>% as_tibble(.subset=-c(condition, type))
This is an efficient implementation of the common pattern of 'do.call(rbind, dfs)' or 'do.call(cbind, dfs)' for binding many data frames into one.
This is an efficient implementation of the common pattern of 'do.call(rbind, dfs)' or 'do.call(cbind, dfs)' for binding many data frames into one.
## S3 method for class 'SummarizedExperiment' bind_rows(..., .id = NULL, add.cell.ids = NULL) ## S3 method for class 'SummarizedExperiment' bind_cols(..., .id = NULL) ## S3 method for class 'RangedSummarizedExperiment' bind_cols(..., .id = NULL)
## S3 method for class 'SummarizedExperiment' bind_rows(..., .id = NULL, add.cell.ids = NULL) ## S3 method for class 'SummarizedExperiment' bind_cols(..., .id = NULL) ## S3 method for class 'RangedSummarizedExperiment' bind_cols(..., .id = NULL)
... |
Data frames to combine. Each argument can either be a data frame, a list that could be a data frame, or a list of data frames. When row-binding, columns are matched by name, and any missing columns will be filled with NA. When column-binding, rows are matched by position, so all data frames must have the same number of rows. To match by value, not position, see mutate-joins. |
.id |
Data frame identifier. When '.id' is supplied, a new column of identifiers is created to link each row to its original data frame. The labels are taken from the named arguments to 'bind_rows()'. When a list of data frames is supplied, the labels are taken from the names of the list. If no names are found a numeric sequence is used instead. |
add.cell.ids |
Appends the corresponding values to |
The output of 'bind_rows()' will contain a column if that column appears in any of the inputs.
The output of 'bind_rows()' will contain a column if that column appears in any of the inputs.
'bind_rows()' and 'bind_cols()' return the same type as the first input, either a data frame, 'tbl_df', or 'grouped_df'.
'bind_rows()' and 'bind_cols()' return the same type as the first input, either a data frame, 'tbl_df', or 'grouped_df'.
data(se) ttservice::bind_rows(se, se) se_bind <- se |> select(dex, albut) se |> ttservice::bind_cols(se_bind)
data(se) ttservice::bind_rows(se, se) se_bind <- se |> select(dex, albut) se |> ttservice::bind_cols(se_bind)
count()
lets you quickly count the unique values of one or more variables:
df %>% count(a, b)
is roughly equivalent to
df %>% group_by(a, b) %>% summarise(n = n())
.
count()
is paired with tally()
, a lower-level helper that is equivalent
to df %>% summarise(n = n())
. Supply wt
to perform weighted counts,
switching the summary from n = n()
to n = sum(wt)
.
add_count()
and add_tally()
are equivalents to count()
and tally()
but use mutate()
instead of summarise()
so that they add a new column
with group-wise counts.
## S3 method for class 'SummarizedExperiment' count( x, ..., wt = NULL, sort = FALSE, name = NULL, .drop = group_by_drop_default(x) )
## S3 method for class 'SummarizedExperiment' count( x, ..., wt = NULL, sort = FALSE, name = NULL, .drop = group_by_drop_default(x) )
x |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
... |
< |
wt |
<
|
sort |
If |
name |
The name of the new column in the output. If omitted, it will default to |
.drop |
Handling of factor levels that don't appear in the data, passed
on to For For |
An object of the same type as .data
. count()
and add_count()
group transiently, so the output has the same groups as the input.
data(se) se |> count(dex)
data(se) se |> count(dex)
Keep only unique/distinct rows from a data frame. This is similar
to unique.data.frame()
but considerably faster.
## S3 method for class 'SummarizedExperiment' distinct(.data, ..., .keep_all = FALSE)
## S3 method for class 'SummarizedExperiment' distinct(.data, ..., .keep_all = FALSE)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< |
.keep_all |
If |
An object of the same type as .data
. The output has the following
properties:
Rows are a subset of the input but appear in the same order.
Columns are not modified if ...
is empty or .keep_all
is TRUE
.
Otherwise, distinct()
first calls mutate()
to create new columns.
Groups are not modified.
Data frame attributes are preserved.
This function is a generic, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
data(pasilla) pasilla |> distinct(.sample)
data(pasilla) pasilla |> distinct(.sample)
extract()
has been superseded in favour of separate_wider_regex()
because it has a more polished API and better handling of problems.
Superseded functions will not go away, but will only receive critical bug
fixes.
Given a regular expression with capturing groups, extract()
turns
each group into a new column. If the groups don't match, or the input
is NA, the output will be NA.
## S3 method for class 'SummarizedExperiment' extract( data, col, into, regex = "([[:alnum:]]+)", remove = TRUE, convert = FALSE, ... )
## S3 method for class 'SummarizedExperiment' extract( data, col, into, regex = "([[:alnum:]]+)", remove = TRUE, convert = FALSE, ... )
data |
A data frame. |
col |
< |
into |
Names of new variables to create as character vector.
Use |
regex |
A string representing a regular expression used to extract the
desired values. There should be one group (defined by |
remove |
If |
convert |
If NB: this will cause string |
... |
Additional arguments passed on to methods. |
tidySummarizedExperiment
separate()
to split up by a separator.
tidySummarizedExperiment::pasilla |> extract(type, into="sequencing", regex="([a-z]*)_end", convert=TRUE)
tidySummarizedExperiment::pasilla |> extract(type, into="sequencing", regex="([a-z]*)_end", convert=TRUE)
The filter()
function is used to subset a data frame,
retaining all rows that satisfy your conditions.
To be retained, the row must produce a value of TRUE
for all conditions.
Note that when a condition evaluates to NA
the row will be dropped, unlike base subsetting with [
.
## S3 method for class 'SummarizedExperiment' filter(.data, ..., .preserve = FALSE)
## S3 method for class 'SummarizedExperiment' filter(.data, ..., .preserve = FALSE)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< |
.preserve |
Relevant when the |
The filter()
function is used to subset the rows of
.data
, applying the expressions in ...
to the column values to determine which
rows should be retained. It can be applied to both grouped and ungrouped data (see group_by()
and
ungroup()
). However, dplyr is not yet smart enough to optimise the filtering
operation on grouped datasets that do not need grouped calculations. For this
reason, filtering is often considerably faster on ungrouped data.
An object of the same type as .data
. The output has the following properties:
Rows are a subset of the input, but appear in the same order.
Columns are not modified.
The number of groups may be reduced (if .preserve
is not TRUE
).
Data frame attributes are preserved.
There are many functions and operators that are useful when constructing the expressions used to filter the data:
Because filtering expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped filtering:
starwars %>% filter(mass > mean(mass, na.rm = TRUE))
With the grouped equivalent:
starwars %>% group_by(gender) %>% filter(mass > mean(mass, na.rm = TRUE))
In the ungrouped version, filter()
compares the value of mass
in each row to
the global average (taken over the whole data set), keeping only the rows with
mass
greater than this global average. In contrast, the grouped version calculates
the average mass separately for each gender
group, and keeps rows with mass
greater
than the relevant within-gender average.
This function is a generic, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
Other single table verbs:
arrange()
,
mutate()
,
reframe()
,
rename()
,
select()
,
slice()
,
summarise()
data(pasilla) pasilla |> filter(.sample == "untrt1") # Learn more in ?dplyr_tidy_eval
data(pasilla) pasilla |> filter(.sample == "untrt1") # Learn more in ?dplyr_tidy_eval
One of the main features of the tbl_df
class is the printing:
Tibbles only print as many rows and columns as fit on one screen, supplemented by a summary of the remaining rows and columns.
Tibble reveals the type of each column, which keeps the user informed about
whether a variable is, e.g., <chr>
or <fct>
(character versus factor).
See vignette("types")
for an overview of common
type abbreviations.
Printing can be tweaked for a one-off call by calling print()
explicitly
and setting arguments like n
and width
. More persistent control is
available by setting the options described in pillar::pillar_options.
See also vignette("digits")
for a comparison to base options,
and vignette("numbers")
that showcases num()
and char()
for creating columns with custom formatting options.
As of tibble 3.1.0, printing is handled entirely by the pillar package.
If you implement a package that extends tibble,
the printed output can be customized in various ways.
See vignette("extending", package = "pillar")
for details,
and pillar::pillar_options for options that control the display in the console.
## S3 method for class 'SummarizedExperiment' print(x, ..., n = NULL, width = NULL, n_extra = NULL)
## S3 method for class 'SummarizedExperiment' print(x, ..., n = NULL, width = NULL, n_extra = NULL)
x |
Object to format or print. |
... |
Passed on to |
n |
Number of rows to show. If |
width |
Width of text output to generate. This defaults to |
n_extra |
Number of extra columns to print abbreviated information for,
if the width is too small for the entire tibble. If |
Prints a message to the console describing
the contents of the tidySummarizedExperiment
.
data(pasilla) print(pasilla)
data(pasilla) print(pasilla)
Mutating joins add columns from y
to x
, matching observations based on
the keys. There are four mutating joins: the inner join, and the three outer
joins.
An inner_join()
only keeps observations from x
that have a matching key
in y
.
The most important property of an inner join is that unmatched rows in either input are not included in the result. This means that generally inner joins are not appropriate in most analyses, because it is too easy to lose observations.
The three outer joins keep observations that appear in at least one of the data frames:
A left_join()
keeps all observations in x
.
A right_join()
keeps all observations in y
.
A full_join()
keeps all observations in x
and y
.
## S3 method for class 'SummarizedExperiment' full_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
## S3 method for class 'SummarizedExperiment' full_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
by |
A join specification created with If To join on different variables between To join by multiple variables, use a
For simple equality joins, you can alternatively specify a character vector
of variable names to join by. For example, To perform a cross-join, generating all combinations of |
copy |
If |
suffix |
If there are non-joined duplicate variables in |
... |
Other parameters passed onto methods. |
An object of the same type as x
(including the same groups). The order of
the rows and columns of x
is preserved as much as possible. The output has
the following properties:
The rows are affect by the join type.
inner_join()
returns matched x
rows.
left_join()
returns all x
rows.
right_join()
returns matched of x
rows, followed by unmatched y
rows.
full_join()
returns all x
rows, followed by unmatched y
rows.
Output columns include all columns from x
and all non-key columns from
y
. If keep = TRUE
, the key columns from y
are included as well.
If non-key columns in x
and y
have the same name, suffix
es are added
to disambiguate. If keep = TRUE
and key columns in x
and y
have
the same name, suffix
es are added to disambiguate these as well.
If keep = FALSE
, output columns included in by
are coerced to their
common type between x
and y
.
By default, dplyr guards against many-to-many relationships in equality joins by throwing a warning. These occur when both of the following are true:
A row in x
matches multiple rows in y
.
A row in y
matches multiple rows in x
.
This is typically surprising, as most joins involve a relationship of one-to-one, one-to-many, or many-to-one, and is often the result of an improperly specified join. Many-to-many relationships are particularly problematic because they can result in a Cartesian explosion of the number of rows returned from the join.
If a many-to-many relationship is expected, silence this warning by
explicitly setting relationship = "many-to-many"
.
In production code, it is best to preemptively set relationship
to whatever
relationship you expect to exist between the keys of x
and y
, as this
forces an error to occur immediately if the data doesn't align with your
expectations.
Inequality joins typically result in many-to-many relationships by nature, so they don't warn on them by default, but you should still take extra care when specifying an inequality join, because they also have the capability to return a large number of rows.
Rolling joins don't warn on many-to-many relationships either, but many
rolling joins follow a many-to-one relationship, so it is often useful to
set relationship = "many-to-one"
to enforce this.
Note that in SQL, most database providers won't let you specify a many-to-many relationship between two tables, instead requiring that you create a third junction table that results in two one-to-many relationships instead.
These functions are generics, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
Methods available in currently loaded packages:
inner_join()
: no methods found.
left_join()
: no methods found.
right_join()
: no methods found.
full_join()
: no methods found.
Other joins:
cross_join()
,
filter-joins
,
nest_join()
data(pasilla) tt <- pasilla tt |> full_join(tibble::tibble(condition="treated", dose=10))
data(pasilla) tt <- pasilla tt |> full_join(tibble::tibble(condition="treated", dose=10))
ggplot
from a tidyseurat
ggplot()
initializes a ggplot object. It can be used to
declare the input data frame for a graphic and to specify the
set of plot aesthetics intended to be common throughout all
subsequent layers unless specifically overridden.
## S3 method for class 'SummarizedExperiment' ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
## S3 method for class 'SummarizedExperiment' ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())
data |
Default dataset to use for plot. If not already a data.frame,
will be converted to one by |
mapping |
Default list of aesthetic mappings to use for plot. If not specified, must be supplied in each layer added to the plot. |
... |
Other arguments passed on to methods. Not currently used. |
environment |
ggplot()
is used to construct the initial plot object,
and is almost always followed by a plus sign (+
) to add
components to the plot.
There are three common patterns used to invoke ggplot()
:
ggplot(data = df, mapping = aes(x, y, other aesthetics))
ggplot(data = df)
ggplot()
The first pattern is recommended if all layers use the same data and the same set of aesthetics, although this method can also be used when adding a layer using data from another data frame.
The second pattern specifies the default data frame to use for the plot, but no aesthetics are defined up front. This is useful when one data frame is used predominantly for the plot, but the aesthetics vary from one layer to another.
The third pattern initializes a skeleton ggplot
object, which
is fleshed out as layers are added. This is useful when
multiple data frames are used to produce different layers, as
is often the case in complex graphics.
The data =
and mapping =
specifications in the arguments are optional
(and are often omitted in practice), so long as the data and the mapping
values are passed into the function in the right order. In the examples
below, however, they are left in place for clarity.
ggplot
The first steps chapter of the online ggplot2 book.
library(ggplot2) data(pasilla) pasilla %>% ggplot(aes(.sample, counts)) + geom_boxplot()
library(ggplot2) data(pasilla) pasilla %>% ggplot(aes(.sample, counts)) + geom_boxplot()
Most data operations are done on groups defined by variables.
group_by()
takes an existing tbl and converts it into a grouped tbl
where operations are performed "by group". ungroup()
removes grouping.
## S3 method for class 'SummarizedExperiment' group_by(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))
## S3 method for class 'SummarizedExperiment' group_by(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
In |
.add |
When This argument was previously called |
.drop |
Drop groups formed by factor levels that don't appear in the
data? The default is |
A grouped data frame with class grouped_df
,
unless the combination of ...
and add
yields a empty set of
grouping columns, in which case a tibble will be returned.
These function are generics, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
Methods available in currently loaded packages:
group_by()
: no methods found.
ungroup()
: no methods found.
Currently, group_by()
internally orders the groups in ascending order. This
results in ordered output from functions that aggregate groups, such as
summarise()
.
When used as grouping columns, character vectors are ordered in the C locale
for performance and reproducibility across R sessions. If the resulting
ordering of your grouped operation matters and is dependent on the locale,
you should follow up the grouped operation with an explicit call to
arrange()
and set the .locale
argument. For example:
data %>% group_by(chr) %>% summarise(avg = mean(x)) %>% arrange(chr, .locale = "en")
This is often useful as a preliminary step before generating content intended for humans, such as an HTML table.
Prior to dplyr 1.1.0, character vector grouping columns were ordered in the
system locale. If you need to temporarily revert to this behavior, you can
set the global option dplyr.legacy_locale
to TRUE
, but this should be
used sparingly and you should expect this option to be removed in a future
version of dplyr. It is better to update existing code to explicitly call
arrange(.locale = )
instead. Note that setting dplyr.legacy_locale
will
also force calls to arrange()
to use the system locale.
Other grouping functions:
group_map()
,
group_nest()
,
group_split()
,
group_trim()
data(pasilla) pasilla |> group_by(.sample)
data(pasilla) pasilla |> group_by(.sample)
group_split()
works like base::split()
but:
It uses the grouping structure from group_by()
and therefore is subject
to the data mask
It does not name the elements of the list based on the grouping as this
only works well for a single character grouping variable. Instead,
use group_keys()
to access a data frame that defines the groups.
group_split()
is primarily designed to work with grouped data frames.
You can pass ...
to group and split an ungrouped data frame, but this
is generally not very useful as you want have easy access to the group
metadata.
## S3 method for class 'SummarizedExperiment' group_split(.tbl, ..., .keep = TRUE)
## S3 method for class 'SummarizedExperiment' group_split(.tbl, ..., .keep = TRUE)
.tbl |
A tbl. |
... |
If |
.keep |
Should the grouping columns be kept? |
A list of tibbles. Each tibble contains the rows of .tbl
for the
associated group and all the columns, including the grouping variables.
Note that this returns a list_of which is slightly
stricter than a simple list but is useful for representing lists where
every element has the same type.
group_split()
is not stable because you can achieve very similar results by
manipulating the nested column returned from
tidyr::nest(.by =)
. That also retains the group keys all
within a single data structure. group_split()
may be deprecated in the
future.
Other grouping functions:
group_by()
,
group_map()
,
group_nest()
,
group_trim()
data(pasilla, package = "tidySummarizedExperiment") pasilla |> group_split(condition) pasilla |> group_split(counts > 0) pasilla |> group_split(condition, counts > 0)
data(pasilla, package = "tidySummarizedExperiment") pasilla |> group_split(condition) pasilla |> group_split(counts > 0) pasilla |> group_split(condition, counts > 0)
Mutating joins add columns from y
to x
, matching observations based on
the keys. There are four mutating joins: the inner join, and the three outer
joins.
An inner_join()
only keeps observations from x
that have a matching key
in y
.
The most important property of an inner join is that unmatched rows in either input are not included in the result. This means that generally inner joins are not appropriate in most analyses, because it is too easy to lose observations.
The three outer joins keep observations that appear in at least one of the data frames:
A left_join()
keeps all observations in x
.
A right_join()
keeps all observations in y
.
A full_join()
keeps all observations in x
and y
.
## S3 method for class 'SummarizedExperiment' inner_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
## S3 method for class 'SummarizedExperiment' inner_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
by |
A join specification created with If To join on different variables between To join by multiple variables, use a
For simple equality joins, you can alternatively specify a character vector
of variable names to join by. For example, To perform a cross-join, generating all combinations of |
copy |
If |
suffix |
If there are non-joined duplicate variables in |
... |
Other parameters passed onto methods. |
An object of the same type as x
(including the same groups). The order of
the rows and columns of x
is preserved as much as possible. The output has
the following properties:
The rows are affect by the join type.
inner_join()
returns matched x
rows.
left_join()
returns all x
rows.
right_join()
returns matched of x
rows, followed by unmatched y
rows.
full_join()
returns all x
rows, followed by unmatched y
rows.
Output columns include all columns from x
and all non-key columns from
y
. If keep = TRUE
, the key columns from y
are included as well.
If non-key columns in x
and y
have the same name, suffix
es are added
to disambiguate. If keep = TRUE
and key columns in x
and y
have
the same name, suffix
es are added to disambiguate these as well.
If keep = FALSE
, output columns included in by
are coerced to their
common type between x
and y
.
By default, dplyr guards against many-to-many relationships in equality joins by throwing a warning. These occur when both of the following are true:
A row in x
matches multiple rows in y
.
A row in y
matches multiple rows in x
.
This is typically surprising, as most joins involve a relationship of one-to-one, one-to-many, or many-to-one, and is often the result of an improperly specified join. Many-to-many relationships are particularly problematic because they can result in a Cartesian explosion of the number of rows returned from the join.
If a many-to-many relationship is expected, silence this warning by
explicitly setting relationship = "many-to-many"
.
In production code, it is best to preemptively set relationship
to whatever
relationship you expect to exist between the keys of x
and y
, as this
forces an error to occur immediately if the data doesn't align with your
expectations.
Inequality joins typically result in many-to-many relationships by nature, so they don't warn on them by default, but you should still take extra care when specifying an inequality join, because they also have the capability to return a large number of rows.
Rolling joins don't warn on many-to-many relationships either, but many
rolling joins follow a many-to-one relationship, so it is often useful to
set relationship = "many-to-one"
to enforce this.
Note that in SQL, most database providers won't let you specify a many-to-many relationship between two tables, instead requiring that you create a third junction table that results in two one-to-many relationships instead.
These functions are generics, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
Methods available in currently loaded packages:
inner_join()
: no methods found.
left_join()
: no methods found.
right_join()
: no methods found.
full_join()
: no methods found.
Other joins:
cross_join()
,
filter-joins
,
nest_join()
data(pasilla) tt <- pasilla tt |> inner_join(tt |> distinct(condition) |> mutate(new_column=1:2) |> slice(1))
data(pasilla) tt <- pasilla tt |> inner_join(tt |> distinct(condition) |> mutate(new_column=1:2) |> slice(1))
Mutating joins add columns from y
to x
, matching observations based on
the keys. There are four mutating joins: the inner join, and the three outer
joins.
An inner_join()
only keeps observations from x
that have a matching key
in y
.
The most important property of an inner join is that unmatched rows in either input are not included in the result. This means that generally inner joins are not appropriate in most analyses, because it is too easy to lose observations.
The three outer joins keep observations that appear in at least one of the data frames:
A left_join()
keeps all observations in x
.
A right_join()
keeps all observations in y
.
A full_join()
keeps all observations in x
and y
.
## S3 method for class 'SummarizedExperiment' left_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
## S3 method for class 'SummarizedExperiment' left_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
by |
A join specification created with If To join on different variables between To join by multiple variables, use a
For simple equality joins, you can alternatively specify a character vector
of variable names to join by. For example, To perform a cross-join, generating all combinations of |
copy |
If |
suffix |
If there are non-joined duplicate variables in |
... |
Other parameters passed onto methods. |
An object of the same type as x
(including the same groups). The order of
the rows and columns of x
is preserved as much as possible. The output has
the following properties:
The rows are affect by the join type.
inner_join()
returns matched x
rows.
left_join()
returns all x
rows.
right_join()
returns matched of x
rows, followed by unmatched y
rows.
full_join()
returns all x
rows, followed by unmatched y
rows.
Output columns include all columns from x
and all non-key columns from
y
. If keep = TRUE
, the key columns from y
are included as well.
If non-key columns in x
and y
have the same name, suffix
es are added
to disambiguate. If keep = TRUE
and key columns in x
and y
have
the same name, suffix
es are added to disambiguate these as well.
If keep = FALSE
, output columns included in by
are coerced to their
common type between x
and y
.
By default, dplyr guards against many-to-many relationships in equality joins by throwing a warning. These occur when both of the following are true:
A row in x
matches multiple rows in y
.
A row in y
matches multiple rows in x
.
This is typically surprising, as most joins involve a relationship of one-to-one, one-to-many, or many-to-one, and is often the result of an improperly specified join. Many-to-many relationships are particularly problematic because they can result in a Cartesian explosion of the number of rows returned from the join.
If a many-to-many relationship is expected, silence this warning by
explicitly setting relationship = "many-to-many"
.
In production code, it is best to preemptively set relationship
to whatever
relationship you expect to exist between the keys of x
and y
, as this
forces an error to occur immediately if the data doesn't align with your
expectations.
Inequality joins typically result in many-to-many relationships by nature, so they don't warn on them by default, but you should still take extra care when specifying an inequality join, because they also have the capability to return a large number of rows.
Rolling joins don't warn on many-to-many relationships either, but many
rolling joins follow a many-to-one relationship, so it is often useful to
set relationship = "many-to-one"
to enforce this.
Note that in SQL, most database providers won't let you specify a many-to-many relationship between two tables, instead requiring that you create a third junction table that results in two one-to-many relationships instead.
These functions are generics, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
Methods available in currently loaded packages:
inner_join()
: no methods found.
left_join()
: no methods found.
right_join()
: no methods found.
full_join()
: no methods found.
Other joins:
cross_join()
,
filter-joins
,
nest_join()
data(pasilla) tt <- pasilla tt |> left_join(tt |> distinct(condition) |> mutate(new_column=1:2))
data(pasilla) tt <- pasilla tt |> left_join(tt |> distinct(condition) |> mutate(new_column=1:2))
mutate()
creates new columns that are functions of existing variables.
It can also modify (if the name is the same as an existing
column) and delete columns (by setting their value to NULL
).
## S3 method for class 'SummarizedExperiment' mutate(.data, ...)
## S3 method for class 'SummarizedExperiment' mutate(.data, ...)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
|
An object of the same type as .data
. The output has the following
properties:
Columns from .data
will be preserved according to the .keep
argument.
Existing columns that are modified by ...
will always be returned in
their original location.
New columns created through ...
will be placed according to the
.before
and .after
arguments.
The number of rows is not affected.
Columns given the value NULL
will be removed.
Groups will be recomputed if a grouping variable is mutated.
Data frame attributes are preserved.
Because mutating expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped mutate:
starwars %>% select(name, mass, species) %>% mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
With the grouped equivalent:
starwars %>% select(name, mass, species) %>% group_by(species) %>% mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
The former normalises mass
by the global average whereas the
latter normalises by the averages within species levels.
This function is a generic, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
Methods available in currently loaded packages: no methods found.
Other single table verbs:
rename()
,
slice()
,
summarise()
data(pasilla) pasilla |> mutate(logcounts=log2(counts))
data(pasilla) pasilla |> mutate(logcounts=log2(counts))
Allows mutate call on features (rowData) of a SummarizedExperiment
mutate_features(.data, ...)
mutate_features(.data, ...)
.data |
a SummarizedExperiment |
... |
extra arguments passed to dplyr::mutate |
a SummarizedExperiment with modified rowData
Allows mutate call on samples (colData) of a SummarizedExperiment
mutate_samples(.data, ...)
mutate_samples(.data, ...)
.data |
a SummarizedExperiment |
... |
extra arguments passed to dplyr::mutate |
a SummarizedExperiment with modified colData
Nesting creates a list-column of data frames; unnesting flattens it back out into regular columns. Nesting is implicitly a summarising operation: you get one row for each group defined by the non-nested columns. This is useful in conjunction with other summaries that work with whole datasets, most notably models.
Learn more in vignette("nest")
.
## S3 method for class 'SummarizedExperiment' nest(.data, ..., .names_sep = NULL)
## S3 method for class 'SummarizedExperiment' nest(.data, ..., .names_sep = NULL)
.data |
A data frame. |
... |
< Specified using name-variable pairs of the form
If not supplied, then :
previously you could write |
.names_sep |
If |
If neither ...
nor .by
are supplied, nest()
will nest all variables,
and will use the column name supplied through .key
.
tidySummarizedExperiment_nested
tidyr 1.0.0 introduced a new syntax for nest()
and unnest()
that's
designed to be more similar to other functions. Converting to the new syntax
should be straightforward (guided by the message you'll receive) but if
you just need to run an old analysis, you can easily revert to the previous
behaviour using nest_legacy()
and unnest_legacy()
as follows:
library(tidyr) nest <- nest_legacy unnest <- unnest_legacy
df %>% nest(data = c(x, y))
specifies the columns to be nested; i.e. the
columns that will appear in the inner data frame. df %>% nest(.by = c(x, y))
specifies the columns to nest by; i.e. the columns that will remain in
the outer data frame. An alternative way to achieve the latter is to nest()
a grouped data frame created by dplyr::group_by()
. The grouping variables
remain in the outer data frame and the others are nested. The result
preserves the grouping of the input.
Variables supplied to nest()
will override grouping variables so that
df %>% group_by(x, y) %>% nest(data = !z)
will be equivalent to
df %>% nest(data = !z)
.
You can't supply .by
with a grouped data frame, as the groups already
represent what you are nesting by.
tidySummarizedExperiment::pasilla |> nest(data=-condition)
tidySummarizedExperiment::pasilla |> nest(data=-condition)
A SummarizedExperiment dataset containing the transcriptome information for Drosophila Melanogaster.
data(pasilla)
data(pasilla)
containing 14599 features and 7 biological replicates.
https://bioconductor.org/packages/release/data/experiment/html/pasilla.html
pivot_longer()
"lengthens" data, increasing the number of rows and
decreasing the number of columns. The inverse transformation is
pivot_wider()
Learn more in vignette("pivot")
.
## S3 method for class 'SummarizedExperiment' pivot_longer( data, cols, ..., cols_vary = "fastest", names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL )
## S3 method for class 'SummarizedExperiment' pivot_longer( data, cols, ..., cols_vary = "fastest", names_to = "name", names_prefix = NULL, names_sep = NULL, names_pattern = NULL, names_ptypes = NULL, names_transform = NULL, names_repair = "check_unique", values_to = "value", values_drop_na = FALSE, values_ptypes = NULL, values_transform = NULL )
data |
A data frame to pivot. |
cols |
< |
... |
Additional arguments passed on to methods. |
cols_vary |
When pivoting
|
names_to |
A character vector specifying the new column or columns to
create from the information stored in the column names of
|
names_prefix |
A regular expression used to remove matching text from the start of each variable name. |
names_sep , names_pattern
|
If
If these arguments do not give you enough control, use
|
names_ptypes , values_ptypes
|
Optionally, a list of column name-prototype
pairs. Alternatively, a single empty prototype can be supplied, which will
be applied to all columns. A prototype (or ptype for short) is a
zero-length vector (like |
names_transform , values_transform
|
Optionally, a list of column
name-function pairs. Alternatively, a single function can be supplied,
which will be applied to all columns. Use these arguments if you need to
change the types of specific columns. For example, If not specified, the type of the columns generated from |
names_repair |
What happens if the output has invalid column names?
The default, |
values_to |
A string specifying the name of the column to create
from the data stored in cell values. If |
values_drop_na |
If |
pivot_longer()
is an updated approach to gather()
, designed to be both
simpler to use and to handle more use cases. We recommend you use
pivot_longer()
for new code; gather()
isn't going away but is no longer
under active development.
tidySummarizedExperiment
# See vignette("pivot") for examples and explanation library(dplyr) tidySummarizedExperiment::pasilla %>% pivot_longer(c(condition, type), names_to="name", values_to="value")
# See vignette("pivot") for examples and explanation library(dplyr) tidySummarizedExperiment::pasilla %>% pivot_longer(c(condition, type), names_to="name", values_to="value")
pivot_wider()
"widens" data, increasing the number of columns and
decreasing the number of rows. The inverse transformation is
pivot_longer()
.
Learn more in vignette("pivot")
.
## S3 method for class 'SummarizedExperiment' pivot_wider( data, ..., id_cols = NULL, id_expand = FALSE, names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL )
## S3 method for class 'SummarizedExperiment' pivot_wider( data, ..., id_cols = NULL, id_expand = FALSE, names_from = name, names_prefix = "", names_sep = "_", names_glue = NULL, names_sort = FALSE, names_vary = "fastest", names_expand = FALSE, names_repair = "check_unique", values_from = value, values_fill = NULL, values_fn = NULL, unused_fn = NULL )
data |
A data frame to pivot. |
... |
Additional arguments passed on to methods. |
id_cols |
< Defaults to all columns in |
id_expand |
Should the values in the |
names_from , values_from
|
< If |
names_prefix |
String added to the start of every variable name. This is
particularly useful if |
names_sep |
If |
names_glue |
Instead of |
names_sort |
Should the column names be sorted? If |
names_vary |
When
|
names_expand |
Should the values in the |
names_repair |
What happens if the output has invalid column names?
The default, |
values_fill |
Optionally, a (scalar) value that specifies what each
This can be a named list if you want to apply different fill values to different value columns. |
values_fn |
Optionally, a function applied to the value in each cell
in the output. You will typically use this when the combination of
This can be a named list if you want to apply different aggregations
to different |
unused_fn |
Optionally, a function applied to summarize the values from
the unused columns (i.e. columns not identified by The default drops all unused columns from the result. This can be a named list if you want to apply different aggregations to different unused columns.
This is similar to grouping by the |
pivot_wider()
is an updated approach to spread()
, designed to be both
simpler to use and to handle more use cases. We recommend you use
pivot_wider()
for new code; spread()
isn't going away but is no longer
under active development.
tidySummarizedExperiment
pivot_wider_spec()
to pivot "by hand" with a data frame that
defines a pivoting specification.
# See vignette("pivot") for examples and explanation library(dplyr) tidySummarizedExperiment::pasilla %>% pivot_wider(names_from=feature, values_from=counts)
# See vignette("pivot") for examples and explanation library(dplyr) tidySummarizedExperiment::pasilla %>% pivot_wider(names_from=feature, values_from=counts)
This function maps R objects to plotly.js,
an (MIT licensed) web-based interactive charting library. It provides
abstractions for doing common things (e.g. mapping data values to
fill colors (via color
) or creating animations (via frame
)) and sets
some different defaults to make the interface feel more 'R-like'
(i.e., closer to plot()
and ggplot2::qplot()
).
## S3 method for class 'tbl_df' plot_ly( data = data.frame(), ..., type = NULL, name = NULL, color = NULL, colors = NULL, alpha = NULL, stroke = NULL, strokes = NULL, alpha_stroke = 1, size = NULL, sizes = c(10, 100), span = NULL, spans = c(1, 20), symbol = NULL, symbols = NULL, linetype = NULL, linetypes = NULL, split = NULL, frame = NULL, width = NULL, height = NULL, source = "A" ) ## S3 method for class 'SummarizedExperiment' plot_ly( data = data.frame(), ..., type = NULL, name = NULL, color = NULL, colors = NULL, alpha = NULL, stroke = NULL, strokes = NULL, alpha_stroke = 1, size = NULL, sizes = c(10, 100), span = NULL, spans = c(1, 20), symbol = NULL, symbols = NULL, linetype = NULL, linetypes = NULL, split = NULL, frame = NULL, width = NULL, height = NULL, source = "A" )
## S3 method for class 'tbl_df' plot_ly( data = data.frame(), ..., type = NULL, name = NULL, color = NULL, colors = NULL, alpha = NULL, stroke = NULL, strokes = NULL, alpha_stroke = 1, size = NULL, sizes = c(10, 100), span = NULL, spans = c(1, 20), symbol = NULL, symbols = NULL, linetype = NULL, linetypes = NULL, split = NULL, frame = NULL, width = NULL, height = NULL, source = "A" ) ## S3 method for class 'SummarizedExperiment' plot_ly( data = data.frame(), ..., type = NULL, name = NULL, color = NULL, colors = NULL, alpha = NULL, stroke = NULL, strokes = NULL, alpha_stroke = 1, size = NULL, sizes = c(10, 100), span = NULL, spans = c(1, 20), symbol = NULL, symbols = NULL, linetype = NULL, linetypes = NULL, split = NULL, frame = NULL, width = NULL, height = NULL, source = "A" )
data |
A data frame (optional) or crosstalk::SharedData object. |
... |
Arguments (i.e., attributes) passed along to the trace |
type |
A character string specifying the trace type (e.g. |
name |
Values mapped to the trace's name attribute. Since a trace can
only have one name, this argument acts very much like |
color |
Values mapped to relevant 'fill-color' attribute(s)
(e.g. fillcolor,
marker.color,
textfont.color, etc.).
The mapping from data values to color codes may be controlled using
|
colors |
Either a colorbrewer2.org palette name (e.g. "YlOrRd" or "Blues"),
or a vector of colors to interpolate in hexadecimal "#RRGGBB" format,
or a color interpolation function like |
alpha |
A number between 0 and 1 specifying the alpha channel applied to |
stroke |
Similar to |
strokes |
Similar to |
alpha_stroke |
Similar to |
size |
(Numeric) values mapped to relevant 'fill-size' attribute(s)
(e.g., marker.size,
textfont.size,
and error_x.width).
The mapping from data values to symbols may be controlled using
|
sizes |
A numeric vector of length 2 used to scale |
span |
(Numeric) values mapped to relevant 'stroke-size' attribute(s)
(e.g.,
marker.line.width,
line.width for filled polygons,
and error_x.thickness)
The mapping from data values to symbols may be controlled using
|
spans |
A numeric vector of length 2 used to scale |
symbol |
(Discrete) values mapped to marker.symbol.
The mapping from data values to symbols may be controlled using
|
symbols |
A character vector of pch values or symbol names. |
linetype |
(Discrete) values mapped to line.dash.
The mapping from data values to symbols may be controlled using
|
linetypes |
A character vector of |
split |
(Discrete) values used to create multiple traces (one trace per value). |
frame |
(Discrete) values used to create animation frames. |
width |
Width in pixels (optional, defaults to automatic sizing). |
height |
Height in pixels (optional, defaults to automatic sizing). |
source |
a character string of length 1. Match the value of this string
with the source argument in |
Unless type
is specified, this function just initiates a plotly
object with 'global' attributes that are passed onto downstream uses of
add_trace()
(or similar). A formula must always be used when
referencing column name(s) in data
(e.g. plot_ly(mtcars, x = ~wt)
).
Formulas are optional when supplying values directly, but they do
help inform default axis/scale titles
(e.g., plot_ly(x = mtcars$wt)
vs plot_ly(x = ~mtcars$wt)
)
plotly
plotly
Carson Sievert
https://plotly-r.com/overview.html
For initializing a plotly-geo object: plot_geo()
For initializing a plotly-mapbox object: plot_mapbox()
For translating a ggplot2 object to a plotly object: ggplotly()
For modifying any plotly object: layout()
, add_trace()
, style()
For linked brushing: highlight()
For arranging multiple plots: subplot()
, crosstalk::bscols()
For inspecting plotly objects: plotly_json()
For quick, accurate, and searchable plotly.js reference: schema()
data(se) se |> plot_ly(x = ~counts) data(se) se |> plot_ly(x = ~counts)
data(se) se |> plot_ly(x = ~counts) data(se) se |> plot_ly(x = ~counts)
pull()
is similar to $
. It's mostly useful because it looks a little
nicer in pipes, it also works with remote data frames, and it can optionally
name the output.
## S3 method for class 'SummarizedExperiment' pull(.data, var = -1, name = NULL, ...)
## S3 method for class 'SummarizedExperiment' pull(.data, var = -1, name = NULL, ...)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
var |
A variable specified as:
The default returns the last column (on the assumption that's the column you've created most recently). This argument is taken by expression and supports quasiquotation (you can unquote column names and column locations). |
name |
An optional parameter that specifies the column to be used
as names for a named vector. Specified in a similar manner as |
... |
For use by methods. |
A vector the same size as .data
.
This function is a generic, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
data(pasilla) pasilla |> pull(feature)
data(pasilla) pasilla |> pull(feature)
rename()
changes the names of individual variables using
new_name = old_name
syntax; rename_with()
renames columns using a
function.
## S3 method for class 'SummarizedExperiment' rename(.data, ...)
## S3 method for class 'SummarizedExperiment' rename(.data, ...)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For For |
An object of the same type as .data
. The output has the following
properties:
Rows are not affected.
Column names are changed; column order is preserved.
Data frame attributes are preserved.
Groups are updated to reflect new names.
This function is a generic, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
Other single table verbs:
mutate()
,
slice()
,
summarise()
data(pasilla) pasilla |> rename(cond=condition)
data(pasilla) pasilla |> rename(cond=condition)
Mutating joins add columns from y
to x
, matching observations based on
the keys. There are four mutating joins: the inner join, and the three outer
joins.
An inner_join()
only keeps observations from x
that have a matching key
in y
.
The most important property of an inner join is that unmatched rows in either input are not included in the result. This means that generally inner joins are not appropriate in most analyses, because it is too easy to lose observations.
The three outer joins keep observations that appear in at least one of the data frames:
A left_join()
keeps all observations in x
.
A right_join()
keeps all observations in y
.
A full_join()
keeps all observations in x
and y
.
## S3 method for class 'SummarizedExperiment' right_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
## S3 method for class 'SummarizedExperiment' right_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ...)
x , y
|
A pair of data frames, data frame extensions (e.g. a tibble), or lazy data frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
by |
A join specification created with If To join on different variables between To join by multiple variables, use a
For simple equality joins, you can alternatively specify a character vector
of variable names to join by. For example, To perform a cross-join, generating all combinations of |
copy |
If |
suffix |
If there are non-joined duplicate variables in |
... |
Other parameters passed onto methods. |
An object of the same type as x
(including the same groups). The order of
the rows and columns of x
is preserved as much as possible. The output has
the following properties:
The rows are affect by the join type.
inner_join()
returns matched x
rows.
left_join()
returns all x
rows.
right_join()
returns matched of x
rows, followed by unmatched y
rows.
full_join()
returns all x
rows, followed by unmatched y
rows.
Output columns include all columns from x
and all non-key columns from
y
. If keep = TRUE
, the key columns from y
are included as well.
If non-key columns in x
and y
have the same name, suffix
es are added
to disambiguate. If keep = TRUE
and key columns in x
and y
have
the same name, suffix
es are added to disambiguate these as well.
If keep = FALSE
, output columns included in by
are coerced to their
common type between x
and y
.
By default, dplyr guards against many-to-many relationships in equality joins by throwing a warning. These occur when both of the following are true:
A row in x
matches multiple rows in y
.
A row in y
matches multiple rows in x
.
This is typically surprising, as most joins involve a relationship of one-to-one, one-to-many, or many-to-one, and is often the result of an improperly specified join. Many-to-many relationships are particularly problematic because they can result in a Cartesian explosion of the number of rows returned from the join.
If a many-to-many relationship is expected, silence this warning by
explicitly setting relationship = "many-to-many"
.
In production code, it is best to preemptively set relationship
to whatever
relationship you expect to exist between the keys of x
and y
, as this
forces an error to occur immediately if the data doesn't align with your
expectations.
Inequality joins typically result in many-to-many relationships by nature, so they don't warn on them by default, but you should still take extra care when specifying an inequality join, because they also have the capability to return a large number of rows.
Rolling joins don't warn on many-to-many relationships either, but many
rolling joins follow a many-to-one relationship, so it is often useful to
set relationship = "many-to-one"
to enforce this.
Note that in SQL, most database providers won't let you specify a many-to-many relationship between two tables, instead requiring that you create a third junction table that results in two one-to-many relationships instead.
These functions are generics, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
Methods available in currently loaded packages:
inner_join()
: no methods found.
left_join()
: no methods found.
right_join()
: no methods found.
full_join()
: no methods found.
Other joins:
cross_join()
,
filter-joins
,
nest_join()
data(pasilla) tt <- pasilla tt |> right_join(tt |> distinct(condition) |> mutate(new_column=1:2) |> slice(1))
data(pasilla) tt <- pasilla tt |> right_join(tt |> distinct(condition) |> mutate(new_column=1:2) |> slice(1))
rowwise()
allows you to compute on a data frame a row-at-a-time.
This is most useful when a vectorised function doesn't exist.
Most dplyr verbs preserve row-wise grouping. The exception is summarise()
,
which return a grouped_df. You can explicitly ungroup with ungroup()
or as_tibble()
, or convert to a grouped_df with group_by()
.
## S3 method for class 'SummarizedExperiment' rowwise(data, ...)
## S3 method for class 'SummarizedExperiment' rowwise(data, ...)
data |
Input data frame. |
... |
< NB: unlike |
A row-wise data frame with class rowwise_df
. Note that a
rowwise_df
is implicitly grouped by row, but is not a grouped_df
.
Because a rowwise has exactly one row per group it offers a small
convenience for working with list-columns. Normally, summarise()
and
mutate()
extract a groups worth of data with [
. But when you index
a list in this way, you get back another list. When you're working with
a rowwise
tibble, then dplyr will use [[
instead of [
to make your
life a little easier.
nest_by()
for a convenient way of creating rowwise data frames
with nested data.
# TODO
# TODO
sample_n()
and sample_frac()
have been superseded in favour of
slice_sample()
. While they will not be deprecated in the near future,
retirement means that we will only perform critical bug fixes, so we recommend
moving to the newer alternative.
These functions were superseded because we realised it was more convenient to
have two mutually exclusive arguments to one function, rather than two
separate functions. This also made it to clean up a few other smaller
design issues with sample_n()
/sample_frac
:
The connection to slice()
was not obvious.
The name of the first argument, tbl
, is inconsistent with other
single table verbs which use .data
.
The size
argument uses tidy evaluation, which is surprising and
undocumented.
It was easier to remove the deprecated .env
argument.
...
was in a suboptimal position.
## S3 method for class 'SummarizedExperiment' sample_n(tbl, size, replace = FALSE, weight = NULL, .env = NULL, ...) ## S3 method for class 'SummarizedExperiment' sample_frac(tbl, size = 1, replace = FALSE, weight = NULL, .env = NULL, ...)
## S3 method for class 'SummarizedExperiment' sample_n(tbl, size, replace = FALSE, weight = NULL, .env = NULL, ...) ## S3 method for class 'SummarizedExperiment' sample_frac(tbl, size = 1, replace = FALSE, weight = NULL, .env = NULL, ...)
tbl |
A data.frame. |
size |
< |
replace |
Sample with or without replacement? |
weight |
< |
.env |
DEPRECATED. |
... |
ignored |
tidySummarizedExperiment
data(pasilla) pasilla |> sample_n(50) pasilla |> sample_frac(0.1)
data(pasilla) pasilla |> sample_n(50) pasilla |> sample_frac(0.1)
A SummarizedExperiment dataset containing the transcriptome information for Drosophila Melanogaster.
data(se)
data(se)
containing 14599 features and 7 biological replicates.
https://bioconductor.org/packages/release/data/experiment/html/pasilla.html
Select (and optionally rename) variables in a data frame, using a concise
mini-language that makes it easy to refer to variables based on their name
(e.g. a:f
selects all columns from a
on the left to f
on the
right) or type (e.g. where(is.numeric)
selects all numeric columns).
Tidyverse selections implement a dialect of R where operators make it easy to select variables:
:
for selecting a range of consecutive variables.
!
for taking the complement of a set of variables.
&
and |
for selecting the intersection or the union of two
sets of variables.
c()
for combining selections.
In addition, you can use selection helpers. Some helpers select specific columns:
everything()
: Matches all variables.
last_col()
: Select last variable, possibly with an offset.
group_cols()
: Select all grouping columns.
Other helpers select variables by matching patterns in their names:
starts_with()
: Starts with a prefix.
ends_with()
: Ends with a suffix.
contains()
: Contains a literal string.
matches()
: Matches a regular expression.
num_range()
: Matches a numerical range like x01, x02, x03.
Or from variables stored in a character vector:
all_of()
: Matches variable names in a character vector. All
names must be present, otherwise an out-of-bounds error is
thrown.
any_of()
: Same as all_of()
, except that no error is thrown
for names that don't exist.
Or using a predicate function:
where()
: Applies a function to all variables and selects those
for which the function returns TRUE
.
## S3 method for class 'SummarizedExperiment' select(.data, ...)
## S3 method for class 'SummarizedExperiment' select(.data, ...)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< |
An object of the same type as .data
. The output has the following
properties:
Rows are not affected.
Output columns are a subset of input columns, potentially with a different
order. Columns will be renamed if new_name = old_name
form is used.
Data frame attributes are preserved.
Groups are maintained; you can't select off grouping variables.
This function is a generic, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
Here we show the usage for the basic selection operators. See the
specific help pages to learn about helpers like starts_with()
.
The selection language can be used in functions like
dplyr::select()
or tidyr::pivot_longer()
. Let's first attach
the tidyverse:
library(tidyverse) # For better printing iris <- as_tibble(iris)
Select variables by name:
starwars %>% select(height) #> # A tibble: 87 x 1 #> height #> <int> #> 1 172 #> 2 167 #> 3 96 #> 4 202 #> # i 83 more rows iris %>% pivot_longer(Sepal.Length) #> # A tibble: 150 x 6 #> Sepal.Width Petal.Length Petal.Width Species name value #> <dbl> <dbl> <dbl> <fct> <chr> <dbl> #> 1 3.5 1.4 0.2 setosa Sepal.Length 5.1 #> 2 3 1.4 0.2 setosa Sepal.Length 4.9 #> 3 3.2 1.3 0.2 setosa Sepal.Length 4.7 #> 4 3.1 1.5 0.2 setosa Sepal.Length 4.6 #> # i 146 more rows
Select multiple variables by separating them with commas. Note how the order of columns is determined by the order of inputs:
starwars %>% select(homeworld, height, mass) #> # A tibble: 87 x 3 #> homeworld height mass #> <chr> <int> <dbl> #> 1 Tatooine 172 77 #> 2 Tatooine 167 75 #> 3 Naboo 96 32 #> 4 Tatooine 202 136 #> # i 83 more rows
Functions like tidyr::pivot_longer()
don't take variables with
dots. In this case use c()
to select multiple variables:
iris %>% pivot_longer(c(Sepal.Length, Petal.Length)) #> # A tibble: 300 x 5 #> Sepal.Width Petal.Width Species name value #> <dbl> <dbl> <fct> <chr> <dbl> #> 1 3.5 0.2 setosa Sepal.Length 5.1 #> 2 3.5 0.2 setosa Petal.Length 1.4 #> 3 3 0.2 setosa Sepal.Length 4.9 #> 4 3 0.2 setosa Petal.Length 1.4 #> # i 296 more rows
The :
operator selects a range of consecutive variables:
starwars %>% select(name:mass) #> # A tibble: 87 x 3 #> name height mass #> <chr> <int> <dbl> #> 1 Luke Skywalker 172 77 #> 2 C-3PO 167 75 #> 3 R2-D2 96 32 #> 4 Darth Vader 202 136 #> # i 83 more rows
The !
operator negates a selection:
starwars %>% select(!(name:mass)) #> # A tibble: 87 x 11 #> hair_color skin_color eye_color birth_year sex gender homeworld species #> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> #> 1 blond fair blue 19 male masculine Tatooine Human #> 2 <NA> gold yellow 112 none masculine Tatooine Droid #> 3 <NA> white, blue red 33 none masculine Naboo Droid #> 4 none white yellow 41.9 male masculine Tatooine Human #> # i 83 more rows #> # i 3 more variables: films <list>, vehicles <list>, starships <list> iris %>% select(!c(Sepal.Length, Petal.Length)) #> # A tibble: 150 x 3 #> Sepal.Width Petal.Width Species #> <dbl> <dbl> <fct> #> 1 3.5 0.2 setosa #> 2 3 0.2 setosa #> 3 3.2 0.2 setosa #> 4 3.1 0.2 setosa #> # i 146 more rows iris %>% select(!ends_with("Width")) #> # A tibble: 150 x 3 #> Sepal.Length Petal.Length Species #> <dbl> <dbl> <fct> #> 1 5.1 1.4 setosa #> 2 4.9 1.4 setosa #> 3 4.7 1.3 setosa #> 4 4.6 1.5 setosa #> # i 146 more rows
&
and |
take the intersection or the union of two selections:
iris %>% select(starts_with("Petal") & ends_with("Width")) #> # A tibble: 150 x 1 #> Petal.Width #> <dbl> #> 1 0.2 #> 2 0.2 #> 3 0.2 #> 4 0.2 #> # i 146 more rows iris %>% select(starts_with("Petal") | ends_with("Width")) #> # A tibble: 150 x 3 #> Petal.Length Petal.Width Sepal.Width #> <dbl> <dbl> <dbl> #> 1 1.4 0.2 3.5 #> 2 1.4 0.2 3 #> 3 1.3 0.2 3.2 #> 4 1.5 0.2 3.1 #> # i 146 more rows
To take the difference between two selections, combine the &
and
!
operators:
iris %>% select(starts_with("Petal") & !ends_with("Width")) #> # A tibble: 150 x 1 #> Petal.Length #> <dbl> #> 1 1.4 #> 2 1.4 #> 3 1.3 #> 4 1.5 #> # i 146 more rows
Other single table verbs:
arrange()
,
filter()
,
mutate()
,
reframe()
,
rename()
,
slice()
,
summarise()
data(pasilla) pasilla |> select(.sample, .feature, counts)
data(pasilla) pasilla |> select(.sample, .feature, counts)
separate()
has been superseded in favour of separate_wider_position()
and separate_wider_delim()
because the two functions make the two uses
more obvious, the API is more polished, and the handling of problems is
better. Superseded functions will not go away, but will only receive
critical bug fixes.
Given either a regular expression or a vector of character positions,
separate()
turns a single character column into multiple columns.
## S3 method for class 'SummarizedExperiment' separate( data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ... )
## S3 method for class 'SummarizedExperiment' separate( data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ... )
data |
A data frame. |
col |
< |
into |
Names of new variables to create as character vector.
Use |
sep |
Separator between columns. If character, If numeric, |
remove |
If |
convert |
If NB: this will cause string |
extra |
If
|
fill |
If
|
... |
Additional arguments passed on to methods. |
tidySummarizedExperiment
unite()
, the complement, extract()
which uses regular
expression capturing groups.
un <- tidySummarizedExperiment::pasilla |> unite("group", c(condition, type)) un |> separate(col=group, into=c("condition", "type"))
un <- tidySummarizedExperiment::pasilla |> unite("group", c(condition, type)) un |> separate(col=group, into=c("condition", "type"))
slice()
lets you index rows by their (integer) locations. It allows you
to select, remove, and duplicate rows. It is accompanied by a number of
helpers for common use cases:
slice_head()
and slice_tail()
select the first or last rows.
slice_sample()
randomly selects rows.
slice_min()
and slice_max()
select rows with the smallest or largest
values of a variable.
If .data
is a grouped_df, the operation will be performed on each group,
so that (e.g.) slice_head(df, n = 5)
will select the first five rows in
each group.
## S3 method for class 'SummarizedExperiment' slice(.data, ..., .preserve = FALSE)
## S3 method for class 'SummarizedExperiment' slice(.data, ..., .preserve = FALSE)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
For Provide either positive values to keep, or negative values to drop. The values provided must be either all positive or all negative. Indices beyond the number of rows in the input are silently ignored. For |
.preserve |
Relevant when the |
Slice does not work with relational databases because they have no
intrinsic notion of row order. If you want to perform the equivalent
operation, use filter()
and row_number()
.
An object of the same type as .data
. The output has the following
properties:
Each row may appear 0, 1, or many times in the output.
Columns are not modified.
Groups are not modified.
Data frame attributes are preserved.
These function are generics, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
Methods available in currently loaded packages:
slice()
: no methods found.
slice_head()
: no methods found.
slice_tail()
: no methods found.
slice_min()
: no methods found.
slice_max()
: no methods found.
slice_sample()
: no methods found.
Other single table verbs:
mutate()
,
rename()
,
summarise()
data(pasilla) pasilla |> slice(1)
data(pasilla) pasilla |> slice(1)
summarise()
creates a new data frame. It returns one row for each
combination of grouping variables; if there are no grouping variables, the
output will have a single row summarising all observations in the input. It
will contain one column for each grouping variable and one column for each of
the summary statistics that you have specified.
summarise()
and summarize()
are synonyms.
## S3 method for class 'SummarizedExperiment' summarise(.data, ...) ## S3 method for class 'SummarizedExperiment' summarize(.data, ...)
## S3 method for class 'SummarizedExperiment' summarise(.data, ...) ## S3 method for class 'SummarizedExperiment' summarize(.data, ...)
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< The value can be:
Returning values with size 0 or >1 was
deprecated as of 1.1.0. Please use |
An object usually of the same type as .data
.
The rows come from the underlying group_keys()
.
The columns are a combination of the grouping keys and the summary expressions that you provide.
The grouping structure is controlled by the .groups=
argument, the
output may be another grouped_df, a tibble or a rowwise data frame.
Data frame attributes are not preserved, because summarise()
fundamentally creates a new data frame.
Count: n()
, n_distinct()
The data frame backend supports creating a variable and using it in the
same summary. This means that previously created summary variables can be
further transformed or combined within the summary, as in mutate()
.
However, it also means that summary variables with the same names as previous
variables overwrite them, making those variables unavailable to later summary
variables.
This behaviour may not be supported in other backends. To avoid unexpected results, consider using new names for your summary variables, especially when creating multiple summaries.
This function is a generic, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
Other single table verbs:
mutate()
,
rename()
,
slice()
data(pasilla) pasilla |> summarise(mean(counts))
data(pasilla) pasilla |> summarise(mean(counts))
For easier customization, the formatting of a tibble is split
into three components: header, body, and footer.
The tbl_format_header()
method is responsible for formatting the header
of a tibble.
Override this method if you need to change the appearance
of the entire header.
If you only need to change or extend the components shown in the header,
override or extend tbl_sum()
for your class which is called by the
default method.
## S3 method for class 'tidySummarizedExperiment' tbl_format_header(x, setup, ...)
## S3 method for class 'tidySummarizedExperiment' tbl_format_header(x, setup, ...)
x |
A tibble-like object. |
setup |
A setup object returned from |
... |
These dots are for future extensions and must be empty. |
A character vector.
# TODO
# TODO
Seurat
tidy for Seurat
tidy(object) ## S3 method for class 'SummarizedExperiment' tidy(object) ## S3 method for class 'RangedSummarizedExperiment' tidy(object)
tidy(object) ## S3 method for class 'SummarizedExperiment' tidy(object) ## S3 method for class 'RangedSummarizedExperiment' tidy(object)
object |
A SummarizedExperiment object |
A tidyseurat
object.
data(pasilla) pasilla %>% tidy()
data(pasilla) pasilla %>% tidy()
Convenience function to paste together multiple columns into one.
## S3 method for class 'SummarizedExperiment' unite(data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)
## S3 method for class 'SummarizedExperiment' unite(data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)
data |
A data frame. |
col |
The name of the new column, as a string or symbol. This argument is passed by expression and supports
quasiquotation (you can unquote strings
and symbols). The name is captured from the expression with
|
... |
< |
sep |
Separator to use between values. |
remove |
If |
na.rm |
If |
tidySummarizedExperiment
separate()
, the complement.
tidySummarizedExperiment::pasilla |> unite("group", c(condition, type))
tidySummarizedExperiment::pasilla |> unite("group", c(condition, type))
Unnest expands a list-column containing data frames into rows and columns.
## S3 method for class 'tidySummarizedExperiment_nested' unnest( data, cols, ..., keep_empty = FALSE, ptype = NULL, names_sep = NULL, names_repair = "check_unique", .drop, .id, .sep, .preserve ) unnest_summarized_experiment( data, cols, ..., keep_empty = FALSE, ptype = NULL, names_sep = NULL, names_repair = "check_unique", .drop, .id, .sep, .preserve )
## S3 method for class 'tidySummarizedExperiment_nested' unnest( data, cols, ..., keep_empty = FALSE, ptype = NULL, names_sep = NULL, names_repair = "check_unique", .drop, .id, .sep, .preserve ) unnest_summarized_experiment( data, cols, ..., keep_empty = FALSE, ptype = NULL, names_sep = NULL, names_repair = "check_unique", .drop, .id, .sep, .preserve )
data |
A data frame. |
cols |
< When selecting multiple columns, values from the same row will be recycled to their common size. |
... |
:
previously you could write |
keep_empty |
By default, you get one row of output for each element
of the list that you are unchopping/unnesting. This means that if there's a
size-0 element (like |
ptype |
Optionally, a named list of column name-prototype pairs to
coerce |
names_sep |
If |
names_repair |
Used to check that output data frame has valid names. Must be one of the following options:
See |
.drop , .preserve
|
:
all list-columns are now preserved; If there are any that you
don't want in the output use |
.id |
:
convert |
.sep |
tidySummarizedExperiment
tidyr 1.0.0 introduced a new syntax for nest()
and unnest()
that's
designed to be more similar to other functions. Converting to the new syntax
should be straightforward (guided by the message you'll receive) but if
you just need to run an old analysis, you can easily revert to the previous
behaviour using nest_legacy()
and unnest_legacy()
as follows:
library(tidyr) nest <- nest_legacy unnest <- unnest_legacy
Other rectangling:
hoist()
,
unnest_longer()
,
unnest_wider()
tidySummarizedExperiment::pasilla |> nest(data=-condition) |> unnest(data) tidySummarizedExperiment::pasilla |> nest(data=-condition) |> unnest_summarized_experiment(data)
tidySummarizedExperiment::pasilla |> nest(data=-condition) |> unnest(data) tidySummarizedExperiment::pasilla |> nest(data=-condition) |> unnest_summarized_experiment(data)