Title: | Preprocessing, analyzing, and reporting of RNA-seq data |
---|---|
Description: | Provides classes and functions for quality control, filtering, normalization and differential expression analysis of pre-processed `RNA-seq` data. Data can be imported from `SummarizedExperiment` as well as `matrix` objects and can be annotated from `BioMart`. Filtering for genes without too low expression or containing required annotations, as well as filtering for samples with sufficient correlation to other samples or total number of reads is supported. The standard normalization methods including cpm, rpkm and tpm can be used, and 'DESeq2` as well as voom differential expression analyses are available. |
Authors: | Daniel Sabanés Bové [aut, cre], Namrata Bhatia [aut], Stefanie Bienert [aut], Benoit Falquet [aut], Haocheng Li [aut], Jeff Luong [aut], Lyndsee Midori Zhang [aut], Alex Richardson [aut], Simona Rossomanno [aut], Tim Treis [aut], Mark Yan [aut], Naomi Chang [aut], Chendi Liao [aut], Carolyn Zhang [aut], Joseph N. Paulson [aut], F. Hoffmann-La Roche AG [cph, fnd] |
Maintainer: | Daniel Sabanés Bové <[email protected]> |
License: | Apache License 2.0 |
Version: | 1.11.0 |
Built: | 2024-12-29 06:55:10 UTC |
Source: | https://github.com/bioc/hermes |
hermes
Packagehermes
facilitates preprocessing, analyzing, and reporting of RNA-seq data.
Maintainer: Daniel Sabanés Bové [email protected]
Authors:
Namrata Bhatia
Stefanie Bienert
Benoit Falquet
Haocheng Li
Jeff Luong
Lyndsee Midori Zhang [email protected]
Alex Richardson
Simona Rossomanno
Tim Treis
Mark Yan
Naomi Chang
Chendi Liao [email protected]
Carolyn Zhang
Joseph N. Paulson [email protected]
Other contributors:
F. Hoffmann-La Roche AG [copyright holder, funder]
Useful links:
Report bugs at https://github.com/insightsengineering/hermes/issues
The function add_quality_flags()
adds quality flag information to a AnyHermesData
object:
low_expression_flag
: for each gene, counts how many samples don't pass a minimum
expression Counts per Million (CPM
) threshold. If too many, then it flags this gene
as a "low expression" gene.
tech_failure_flag
: first calculates the Pearson correlation matrix of the sample wise
CPM
values, resulting in a matrix measuring the correlation between samples.
Then compares the average correlation per sample with a threshold - if it is too low,
then the sample is flagged as a "technical failure" sample.
low_depth_flag
: computes the library size (total number of counts) per sample.
If this number is too low, the sample is flagged as a "low depth" sample.
Separate helper functions are internally used to create the flags, and
separate getter
functions allow easy access to the quality control flags in an object.
add_quality_flags(object, control = control_quality(), overwrite = FALSE) h_low_expression_flag(object, control = control_quality()) h_low_depth_flag(object, control = control_quality()) h_tech_failure_flag(object, control = control_quality()) get_tech_failure(object) get_low_depth(object) get_low_expression(object)
add_quality_flags(object, control = control_quality(), overwrite = FALSE) h_low_expression_flag(object, control = control_quality()) h_low_depth_flag(object, control = control_quality()) h_tech_failure_flag(object, control = control_quality()) get_tech_failure(object) get_low_depth(object) get_low_expression(object)
object |
( |
control |
( |
overwrite |
( |
While object
already has the variables mentioned above as part of the
rowData
and colData
(as this is enforced by the validation
method for AnyHermesData
), they are usually still NA
after the initial
object creation.
The input object with added quality flags.
h_low_expression_flag()
: creates the low expression flag for genes
given control settings.
h_low_depth_flag()
: creates the low depth (library size) flag for samples
given control settings.
h_tech_failure_flag()
: creates the technical failure flag for samples
given control settings.
get_tech_failure()
: get the technical failure flags for all samples.
get_low_depth()
: get the low depth failure flags for all samples.
get_low_expression()
: get the low expression failure flags for all genes.
control_quality()
for the detailed settings specifications;
set_tech_failure()
to manually flag samples as technical failures.
# Adding default quality flags to `AnyHermesData` object. object <- hermes_data result <- add_quality_flags(object) which(get_tech_failure(result) != get_tech_failure(object)) head(get_low_expression(result)) head(get_tech_failure(result)) head(get_low_depth(result)) # It is possible to overwrite flags if needed, which will trigger a message. result2 <- add_quality_flags(result, control_quality(min_cpm = 1000), overwrite = TRUE) # Separate calculation of low expression flag. low_expr_flag <- h_low_expression_flag( object, control_quality(min_cpm = 500, min_cpm_prop = 0.9) ) length(low_expr_flag) == nrow(object) head(low_expr_flag) # Separate calculation of low depth flag. low_depth_flag <- h_low_depth_flag(object, control_quality(min_depth = 5)) length(low_depth_flag) == ncol(object) head(low_depth_flag) # Separate calculation of technical failure flag. tech_failure_flag <- h_tech_failure_flag(object, control_quality(min_corr = 0.35)) length(tech_failure_flag) == ncol(object) head(tech_failure_flag) head(get_tech_failure(object)) head(get_low_depth(object)) head(get_low_expression(object))
# Adding default quality flags to `AnyHermesData` object. object <- hermes_data result <- add_quality_flags(object) which(get_tech_failure(result) != get_tech_failure(object)) head(get_low_expression(result)) head(get_tech_failure(result)) head(get_low_depth(result)) # It is possible to overwrite flags if needed, which will trigger a message. result2 <- add_quality_flags(result, control_quality(min_cpm = 1000), overwrite = TRUE) # Separate calculation of low expression flag. low_expr_flag <- h_low_expression_flag( object, control_quality(min_cpm = 500, min_cpm_prop = 0.9) ) length(low_expr_flag) == nrow(object) head(low_expr_flag) # Separate calculation of low depth flag. low_depth_flag <- h_low_depth_flag(object, control_quality(min_depth = 5)) length(low_depth_flag) == ncol(object) head(low_depth_flag) # Separate calculation of technical failure flag. tech_failure_flag <- h_tech_failure_flag(object, control_quality(min_corr = 0.35)) length(tech_failure_flag) == ncol(object) head(tech_failure_flag) head(get_tech_failure(object)) head(get_low_depth(object)) head(get_low_expression(object))
Internal function to check whether a whole vector is NA
.
all_na(x)
all_na(x)
x |
( |
Corresponding flag.
These methods access and set the gene annotations stored in a AnyHermesData
object.
## S4 method for signature 'AnyHermesData' annotation(object, ...) .row_data_annotation_cols ## S4 replacement method for signature 'AnyHermesData,DataFrame' annotation(object) <- value
## S4 method for signature 'AnyHermesData' annotation(object, ...) .row_data_annotation_cols ## S4 replacement method for signature 'AnyHermesData,DataFrame' annotation(object) <- value
object |
( |
... |
not used. |
value |
( |
The annotation column names are available in the exported
character vector .row_data_annotation_cols
.
The S4Vectors::DataFrame
with the gene annotations:
symbol
desc
chromosome
size
When trying to replace the required annotations with completely missing
values for any genes, a warning will be given and the corresponding gene
IDs will be saved in the attribute annotation.missing.genes
. Note also
that additional annotations beyond the required ones may be supplied and
will be stored.
object <- hermes_data head(annotation(object))
object <- hermes_data head(annotation(object))
The documentation to this function lists all the conventional arguments in
additional checkmate
assertions.
x |
an object to check. |
null.ok |
( |
.var.name |
( |
add |
( |
info |
( |
label |
( |
assert_that
We provide additional assertion functions which can be used together with
assertthat::assert_that()
.
We provide additional assertion functions which can be used together with
the checkmate
functions. These are described in individual help pages
linked below.
is_class(x, class2) is_hermes_data(x) is_counts_vector(x) is_list_with(x, elements) one_provided(one, two) is_constant(x)
is_class(x, class2) is_hermes_data(x) is_counts_vector(x) is_list_with(x, elements) one_provided(one, two) is_constant(x)
x |
an object to check. |
class2 |
( |
elements |
( |
one |
first input. |
two |
second input. |
Depending on the function prefix.
assert_
functions return the object invisibly if successful, and otherwise
throw an error message.
check_
functions return TRUE
if successful, otherwise a string with the
error message.
test_
functions just return TRUE
or FALSE
.
is_class()
: checks the class.
is_hermes_data()
: checks whether x
is an AnyHermesData
object.
is_counts_vector()
: checks for a vector of counts (positive integers).
is_list_with()
: checks for a list containing elements.
one_provided()
: checks that exactly one of the two inputs one
, two
is not NULL
.
is_constant()
: checks whether the vector x
is constant (only supports numeric
, factor
,
character
, logical
). NA
s are removed first.
# Assert a general class. a <- 5 is_class(a, "character") # Assert a `AnyHermesData` object. is_hermes_data(hermes_data) is_hermes_data(42) # Assert a counts vector. a <- 5L is_counts_vector(a) # Assert a list containing certain elements. b <- list(a = 5, b = 3) is_list_with(b, c("a", "c")) is_list_with(b, c("a", "b")) # Assert that exactly one of two arguments is provided. a <- 10 b <- 10 one_provided(a, b) one_provided(a, NULL) # Assert a constant vector. is_constant(c(1, 2)) is_constant(c(NA, 1)) is_constant(c("a", "a")) is_constant(factor(c("a", "a")))
# Assert a general class. a <- 5 is_class(a, "character") # Assert a `AnyHermesData` object. is_hermes_data(hermes_data) is_hermes_data(42) # Assert a counts vector. a <- 5L is_counts_vector(a) # Assert a list containing certain elements. b <- list(a = 5, b = 3) is_list_with(b, c("a", "c")) is_list_with(b, c("a", "b")) # Assert that exactly one of two arguments is provided. a <- 10 b <- 10 one_provided(a, b) one_provided(a, NULL) # Assert a constant vector. is_constant(c(1, 2)) is_constant(c(NA, 1)) is_constant(c("a", "a")) is_constant(factor(c("a", "a")))
This generates all standard plots - histogram and q-q plot of library sizes, density plot of the (log) counts distributions, boxplot of the number of number of non-zero expressed genes per sample, and a stacked barplot of low expression genes by chromosome at default setting.
## S4 method for signature 'AnyHermesData' autoplot(object)
## S4 method for signature 'AnyHermesData' autoplot(object)
object |
( |
A list with the ggplot
objects from draw_libsize_hist()
, draw_libsize_qq()
,
draw_libsize_densities()
, draw_nonzero_boxplot()
and draw_genes_barplot()
functions with default settings.
result <- hermes_data autoplot(result)
result <- hermes_data autoplot(result)
The calc_pca()
function performs principal components analysis of the gene count
vectors across all samples.
A corresponding autoplot()
method then can visualize the results.
calc_pca(object, assay_name = "counts", n_top = NULL)
calc_pca(object, assay_name = "counts", n_top = NULL)
object |
( |
assay_name |
( |
n_top |
( |
PCA should be performed after filtering out low quality genes and samples, as well as normalization of counts.
In addition, genes with constant counts across all samples are excluded from
the analysis internally in calc_pca()
. Centering and scaling is also applied internally.
Plots can be obtained with the ggplot2::autoplot()
function
with the corresponding method from the ggfortify
package to plot the
results of a principal components analysis saved in a HermesDataPca
object. See ggfortify::autoplot.prcomp()
for details.
A HermesDataPca object which is an extension of the stats::prcomp class.
Afterwards correlations between principal components
and sample variables can be calculated, see pca_cor_samplevar
.
object <- hermes_data %>% add_quality_flags() %>% filter() %>% normalize() result <- calc_pca(object, assay_name = "tpm") summary(result) result1 <- calc_pca(object, assay_name = "tpm", n_top = 500) summary(result1) # Plot the results. autoplot(result) autoplot(result, x = 2, y = 3) autoplot(result, variance_percentage = FALSE) autoplot(result, label = TRUE, label.repel = TRUE)
object <- hermes_data %>% add_quality_flags() %>% filter() %>% normalize() result <- calc_pca(object, assay_name = "tpm") summary(result) result1 <- calc_pca(object, assay_name = "tpm", n_top = 500) summary(result1) # Plot the results. autoplot(result) autoplot(result, x = 2, y = 3) autoplot(result, variance_percentage = FALSE) autoplot(result, label = TRUE, label.repel = TRUE)
This function concatenates inputs like cat()
and prints them with newline.
cat_with_newline(...)
cat_with_newline(...)
... |
inputs to concatenate. |
None, only used for the side effect of producing the concatenated output in the R console.
This is similar to cli::cat_line()
.
cat_with_newline("hello", "world")
cat_with_newline("hello", "world")
AnyHermesData
ObjectsThis method combines AnyHermesData
objects with the same ranges but different
samples (columns in assays).
... |
( |
The combined AnyHermesData
object.
Note that this just inherits
SummarizedExperiment::cbind,SummarizedExperiment-method()
. When binding a
AnyHermesData
object with a SummarizedExperiment::SummarizedExperiment
object, then the result will be a
SummarizedExperiment::SummarizedExperiment
object (the more general
class).
Note that the combined object needs to have unique sample IDs (column names).
rbind
to row bind objects.
a <- hermes_data[, 1:10] b <- hermes_data[, 11:20] result <- cbind(a, b) class(result)
a <- hermes_data[, 1:10] b <- hermes_data[, 11:20] result <- cbind(a, b) class(result)
Check whether x
is a (single) proportion.
check_proportion(x, null.ok = FALSE) assert_proportion( x, null.ok = FALSE, .var.name = checkmate::vname(x), add = NULL ) test_proportion(x, null.ok = FALSE) expect_proportion(x, null.ok = FALSE, info = NULL, label = vname(x))
check_proportion(x, null.ok = FALSE) assert_proportion( x, null.ok = FALSE, .var.name = checkmate::vname(x), add = NULL ) test_proportion(x, null.ok = FALSE) expect_proportion(x, null.ok = FALSE, info = NULL, label = vname(x))
x |
an object to check. |
null.ok |
( |
.var.name |
( |
add |
( |
info |
( |
label |
( |
TRUE
if successful, otherwise a string with the error message.
assertions
for more details.
check_proportion(0.25)
check_proportion(0.25)
This obtains the sample variables of a HermesData
object together
with selected gene information.
col_data_with_genes(object, assay_name, genes)
col_data_with_genes(object, assay_name, genes)
object |
( |
assay_name |
( |
genes |
( |
The combined data set, where the additional attribute gene_cols
contains
the names of the columns obtained by extracting the genes
information.
The class of the returned data set will depend on the class of colData
, so usually
will be S4Vectors::DFrame
.
result <- col_data_with_genes(hermes_data, "counts", gene_spec("GeneID:1820")) tail(names(result)) result$GeneID.1820
result <- col_data_with_genes(hermes_data, "counts", gene_spec("GeneID:1820")) tail(names(result)) result$GeneID.1820
This helper function returns the Z-score from an assay stored as a matrix
.
colMeanZscores(x)
colMeanZscores(x)
x |
( |
A numeric vector containing the mean Z-score values for each
column in x
.
object <- hermes_data %>% add_quality_flags() %>% filter() %>% normalize() %>% assay("counts") colMeanZscores(object)
object <- hermes_data %>% add_quality_flags() %>% filter() %>% normalize() %>% assay("counts") colMeanZscores(object)
This helper function returns the first principal component from an assay
stored as a matrix
.
colPrinComp1(x, center = TRUE, scale = TRUE)
colPrinComp1(x, center = TRUE, scale = TRUE)
x |
( |
center |
( |
scale |
( |
A numeric vector containing the principal component values for each
column in x
.
object <- hermes_data %>% add_quality_flags() %>% filter() %>% normalize() %>% assay("counts") colPrinComp1(object)
object <- hermes_data %>% add_quality_flags() %>% filter() %>% normalize() %>% assay("counts") colPrinComp1(object)
BioMart
connect_biomart()
creates a connection object of class ConnectionBiomart
which contains
the biomaRt
object of class biomaRt::Mart
and the prefix of the object
which is used downstream for the query.
connect_biomart(prefix = c("ENSG", "GeneID"), version = NULL)
connect_biomart(prefix = c("ENSG", "GeneID"), version = NULL)
prefix |
( |
version |
( |
This connects to the Ensembl
data base of BioMart
for human genes.
A specific version can be optionally chosen to ensure reproducibility of results
once a new release is available, as accessed data might then change.
ConnectionBiomart
object.
if (interactive()) { connection <- connect_biomart("ENSG") }
if (interactive()) { connection <- connect_biomart("ENSG") }
This control function allows for easy customization of the normalization settings.
control_normalize( log = TRUE, lib_sizes = NULL, prior_count = 1, fit_type = "parametric" )
control_normalize( log = TRUE, lib_sizes = NULL, prior_count = 1, fit_type = "parametric" )
log |
( |
lib_sizes |
( |
prior_count |
(non-negative |
fit_type |
( |
List with the above settings used to perform the normalization procedure.
To be used with the normalize()
function.
control_normalize() control_normalize(log = FALSE, lib_sizes = rep(1e6L, 20))
control_normalize() control_normalize(log = FALSE, lib_sizes = rep(1e6L, 20))
Control function which specifies the quality flag settings. One or more settings can be customized. Not specified settings are left at defaults.
control_quality( min_cpm = 1, min_cpm_prop = 0.25, min_corr = 0.5, min_depth = NULL )
control_quality( min_cpm = 1, min_cpm_prop = 0.25, min_corr = 0.5, min_depth = NULL )
min_cpm |
(non-negative |
min_cpm_prop |
( |
min_corr |
( |
min_depth |
(non-negative |
List with the above criteria to flag observations.
To be used with the add_quality_flags()
function.
# Default settings. control_quality() # One or more settings can be customized. control_quality(min_cpm = 5, min_cpm_prop = 0.001)
# Default settings. control_quality() # One or more settings can be customized. control_quality(min_cpm = 5, min_cpm_prop = 0.001)
New generic function to calculate correlations for one or two objects.
correlate(object, ...)
correlate(object, ...)
object |
input of which the class will be used to decide the method. |
... |
additional arguments. |
Corresponding object that contains the correlation results.
pca_cor_samplevar and calc_cor which are the methods included for this generic function.
sample_cors <- correlate(hermes_data) autoplot(sample_cors) pca_sample_var_cors <- correlate(calc_pca(hermes_data), hermes_data) autoplot(pca_sample_var_cors)
sample_cors <- correlate(hermes_data) autoplot(sample_cors) pca_sample_var_cors <- correlate(calc_pca(hermes_data), hermes_data) autoplot(pca_sample_var_cors)
AnyHermesData
The correlate()
method can calculate the correlation matrix between the sample vectors of
counts from a specified assay. This produces a HermesDataCor
object, which is an extension
of a matrix
with additional quality flags in the slot flag_data
(containing the tech_failure_flag
and low_depth_flag
columns describing the original
input samples).
An autoplot()
method then afterwards can produce the corresponding heatmap.
## S4 method for signature 'AnyHermesData' correlate(object, assay_name = "counts", method = "pearson", ...) ## S4 method for signature 'HermesDataCor' autoplot( object, flag_colors = c(`FALSE` = "green", `TRUE` = "red"), cor_colors = circlize::colorRamp2(c(0, 0.5, 1), c("red", "yellow", "green")), ... )
## S4 method for signature 'AnyHermesData' correlate(object, assay_name = "counts", method = "pearson", ...) ## S4 method for signature 'HermesDataCor' autoplot( object, flag_colors = c(`FALSE` = "green", `TRUE` = "red"), cor_colors = circlize::colorRamp2(c(0, 0.5, 1), c("red", "yellow", "green")), ... )
object |
( |
assay_name |
( |
method |
( |
... |
other arguments to be passed to |
flag_colors |
(named |
cor_colors |
( |
A HermesDataCor
object.
autoplot(HermesDataCor)
: This autoplot()
method uses the ComplexHeatmap::Heatmap()
function
to plot the correlations between samples saved in a HermesDataCor
object.
object <- hermes_data # Calculate the sample correlation matrix. correlate(object) # We can specify another correlation coefficient to be calculated. result <- correlate(object, method = "spearman") # Plot the correlation matrix. autoplot(result) # We can customize the heatmap. autoplot(result, show_column_names = FALSE, show_row_names = FALSE) # Including changing the axis label text size. autoplot( result, row_names_gp = grid::gpar(fontsize = 8), column_names_gp = grid::gpar(fontsize = 8) )
object <- hermes_data # Calculate the sample correlation matrix. correlate(object) # We can specify another correlation coefficient to be calculated. result <- correlate(object, method = "spearman") # Plot the correlation matrix. autoplot(result) # We can customize the heatmap. autoplot(result, show_column_names = FALSE, show_row_names = FALSE) # Including changing the axis label text size. autoplot( result, row_names_gp = grid::gpar(fontsize = 8), column_names_gp = grid::gpar(fontsize = 8) )
This correlate()
method analyses the correlations (in R2 values) between all sample variables
in a AnyHermesData
object and the principal components of the samples.
A corresponding autoplot()
method then can visualize the results in a heatmap.
## S4 method for signature 'HermesDataPca' correlate(object, data) ## S4 method for signature 'HermesDataPcaCor' autoplot( object, cor_colors = circlize::colorRamp2(c(-1, 0, 1), c("blue", "white", "red")), ... )
## S4 method for signature 'HermesDataPca' correlate(object, data) ## S4 method for signature 'HermesDataPcaCor' autoplot( object, cor_colors = circlize::colorRamp2(c(-1, 0, 1), c("blue", "white", "red")), ... )
object |
( |
data |
( |
cor_colors |
( |
... |
other arguments to be passed to |
A HermesDataPcaCor
object with R2 values for all sample variables.
autoplot(HermesDataPcaCor)
: This plot method uses the ComplexHeatmap::Heatmap()
function
to visualize a HermesDataPcaCor
object.
h_pca_df_r2_matrix()
which is used internally for the details.
object <- hermes_data %>% add_quality_flags() %>% filter() %>% normalize() # Perform PCA and then correlate the prinicipal components with the sample variables. object_pca <- calc_pca(object) result <- correlate(object_pca, object) # Visualize the correlations in a heatmap. autoplot(result) # We can also choose to not reorder the columns. autoplot(result, cluster_columns = FALSE) # We can also choose break-points for color customization. autoplot( result, cor_colors = circlize::colorRamp2( c(-0.5, -0.25, 0, 0.25, 0.5, 0.75, 1), c("blue", "green", "purple", "yellow", "orange", "red", "brown") ) )
object <- hermes_data %>% add_quality_flags() %>% filter() %>% normalize() # Perform PCA and then correlate the prinicipal components with the sample variables. object_pca <- calc_pca(object) result <- correlate(object_pca, object) # Visualize the correlations in a heatmap. autoplot(result) # We can also choose to not reorder the columns. autoplot(result, cluster_columns = FALSE) # We can also choose break-points for color customization. autoplot( result, cor_colors = circlize::colorRamp2( c(-0.5, -0.25, 0, 0.25, 0.5, 0.75, 1), c("blue", "green", "purple", "yellow", "orange", "red", "brown") ) )
These methods access and set the counts assay in a AnyHermesData
object.
## S4 method for signature 'AnyHermesData' counts(object, ...) ## S4 replacement method for signature 'AnyHermesData,matrix' counts(object, ..., withDimnames = TRUE) <- value
## S4 method for signature 'AnyHermesData' counts(object, ...) ## S4 replacement method for signature 'AnyHermesData,matrix' counts(object, ..., withDimnames = TRUE) <- value
object |
( |
... |
not used. |
withDimnames |
( |
value |
( |
The counts assay.
counts(object = AnyHermesData) <- value
:
a <- hermes_data result <- counts(a) class(result) head(result) counts(a) <- counts(a) + 100L head(counts(a))
a <- hermes_data result <- counts(a) class(result) head(result) counts(a) <- counts(a) + 100L head(counts(a))
This function transforms a numeric vector into a factor corresponding to the quantile intervals. The intervals are left-open and right-closed.
cut_quantile(x, percentiles = c(1/3, 2/3), digits = 0)
cut_quantile(x, percentiles = c(1/3, 2/3), digits = 0)
x |
( |
percentiles |
( |
digits |
( |
The factor with a description of the available quantiles as levels.
set.seed(452) x <- runif(10, -10, 10) cut_quantile(x, c(0.33333333, 0.6666666), digits = 4) x[1:4] <- NA cut_quantile(x)
set.seed(452) x <- runif(10, -10, 10) cut_quantile(x, c(0.33333333, 0.6666666), digits = 4) x[1:4] <- NA cut_quantile(x)
DataFrame
This utility function converts all eligible character and logical variables in a
S4Vectors::DataFrame
to factor variables. All factor variables get amended
with an explicit missing level. This includes both NA
and empty strings.
df_cols_to_factor(data, omit_columns = NULL, na_level = "<Missing>")
df_cols_to_factor(data, omit_columns = NULL, na_level = "<Missing>")
data |
( |
omit_columns |
( |
na_level |
( |
The modified data.
All required rowData
and colData
variables cannot be converted
to ensure proper downstream behavior. These are automatically omitted if found in data
and therefore do not need to be specified in omit_columns
.
dat <- colData(summarized_experiment) any(vapply(dat, is.character, logical(1))) any(vapply(dat, is.logical, logical(1))) dat_converted <- df_cols_to_factor(dat) any(vapply(dat_converted, function(x) is.character(x) || is.logical(x), logical(1)))
dat <- colData(summarized_experiment) any(vapply(dat, is.character, logical(1))) any(vapply(dat, is.logical, logical(1))) dat_converted <- df_cols_to_factor(dat) any(vapply(dat_converted, function(x) is.character(x) || is.logical(x), logical(1)))
The diff_expression()
function performs differential expression analysis
using a method of preference.
A corresponding autoplot()
method is visualizing the results as a volcano plot.
diff_expression(object, group, method = c("voom", "deseq2"), ...) ## S4 method for signature 'HermesDataDiffExpr' autoplot(object, adj_p_val_thresh = 0.05, log2_fc_thresh = 2.5)
diff_expression(object, group, method = c("voom", "deseq2"), ...) ## S4 method for signature 'HermesDataDiffExpr' autoplot(object, adj_p_val_thresh = 0.05, log2_fc_thresh = 2.5)
object |
( |
group |
( |
method |
( |
... |
additional arguments passed to the helper function associated with the selected method. |
adj_p_val_thresh |
( |
log2_fc_thresh |
( |
Possible method choices are:
voom: uses limma::voom()
, see h_diff_expr_voom()
for details.
deseq2
: uses DESeq2::DESeq()
, see h_diff_expr_deseq2()
for details.
A HermesDataDiffExpr
object which is a data frame with the following columns for each gene
in the HermesData
object:
log2_fc
(the estimate of the log2 fold change between the 2 levels of the
provided factor)
stat
(the test statistic, which one depends on the method used)
p_val
(the raw p-value)
adj_p_val
(the multiplicity adjusted p-value value)
autoplot(HermesDataDiffExpr)
: generates a volcano plot for a HermesDataDiffExpr
object.
We provide the df_cols_to_factor()
utility function that makes it easy to convert the
colData()
character and logical variables to factors, so that they can be subsequently
used as group
inputs. See the example.
In order to avoid a warning when using deseq2
, it can be necessary to specify
fitType = "local"
as additional argument. This could e.g. be the case when only few samples
are present in which case the default parametric dispersions estimation will not work.
object <- hermes_data %>% add_quality_flags() %>% filter() # Convert character and logical to factor variables in `colData`, # including the below used `group` variable. colData(object) <- df_cols_to_factor(colData(object)) res1 <- diff_expression(object, group = "SEX", method = "voom") head(res1) res2 <- diff_expression(object, group = "SEX", method = "deseq2") head(res2) # Pass method arguments to the internally used helper functions. res3 <- diff_expression(object, group = "SEX", method = "voom", robust = TRUE, trend = TRUE) head(res3) res4 <- diff_expression(object, group = "SEX", method = "deseq2", fitType = "local") head(res4) # Create the corresponding volcano plots. autoplot(res1) autoplot(res3)
object <- hermes_data %>% add_quality_flags() %>% filter() # Convert character and logical to factor variables in `colData`, # including the below used `group` variable. colData(object) <- df_cols_to_factor(colData(object)) res1 <- diff_expression(object, group = "SEX", method = "voom") head(res1) res2 <- diff_expression(object, group = "SEX", method = "deseq2") head(res2) # Pass method arguments to the internally used helper functions. res3 <- diff_expression(object, group = "SEX", method = "voom", robust = TRUE, trend = TRUE) head(res3) res4 <- diff_expression(object, group = "SEX", method = "deseq2", fitType = "local") head(res4) # Create the corresponding volcano plots. autoplot(res1) autoplot(res3)
This produces a barplot of the dichotomized gene expression counts into two or three categories based on custom defined percentiles.
draw_barplot( object, assay_name, x_spec, facet_var = NULL, fill_var = NULL, percentiles = c(1/3, 2/3) )
draw_barplot( object, assay_name, x_spec, facet_var = NULL, fill_var = NULL, percentiles = c(1/3, 2/3) )
object |
( |
assay_name |
( |
x_spec |
( |
facet_var |
( |
fill_var |
( |
percentiles |
( |
The ggplot
barplot.
object <- hermes_data g <- genes(object) draw_barplot( object, assay_name = "counts", x_spec = gene_spec(g[1]), facet_var = "SEX", fill_var = "AGE18" ) draw_barplot( object, assay_name = "counts", x_spec = gene_spec(g[1:3], colMedians, "Median"), facet_var = "SEX", fill_var = "AGE18" ) draw_barplot( object, assay_name = "counts", x_spec = gene_spec(g[1:3], colMeans, "Mean"), facet_var = "SEX", fill_var = "AGE18", percentiles = c(0.1, 0.9) )
object <- hermes_data g <- genes(object) draw_barplot( object, assay_name = "counts", x_spec = gene_spec(g[1]), facet_var = "SEX", fill_var = "AGE18" ) draw_barplot( object, assay_name = "counts", x_spec = gene_spec(g[1:3], colMedians, "Median"), facet_var = "SEX", fill_var = "AGE18" ) draw_barplot( object, assay_name = "counts", x_spec = gene_spec(g[1:3], colMeans, "Mean"), facet_var = "SEX", fill_var = "AGE18", percentiles = c(0.1, 0.9) )
This produces boxplots of the gene expression values of a single gene, multiple genes or a gene signature.
draw_boxplot( object, assay_name, genes, x_var = NULL, color_var = NULL, facet_var = NULL, violin = FALSE, jitter = FALSE ) h_draw_boxplot_df(object, assay_name, genes, x_var, color_var, facet_var)
draw_boxplot( object, assay_name, genes, x_var = NULL, color_var = NULL, facet_var = NULL, violin = FALSE, jitter = FALSE ) h_draw_boxplot_df(object, assay_name, genes, x_var, color_var, facet_var)
object |
( |
assay_name |
( |
genes |
( |
x_var |
( |
color_var |
( |
facet_var |
( |
violin |
( |
jitter |
( |
The ggplot
boxplot.
h_draw_boxplot_df()
: Helper function to prepare the data frame required
for plotting.
object <- hermes_data draw_boxplot( object, assay_name = "counts", genes = gene_spec(c(A = genes(object)[1])), violin = TRUE ) object2 <- object %>% add_quality_flags() %>% filter() %>% normalize() draw_boxplot( object2, assay_name = "tpm", x_var = "SEX", genes = gene_spec(setNames(genes(object2)[1:10], 1:10), fun = colMeans), facet_var = "RACE", color_var = "AGE18", jitter = TRUE ) draw_boxplot( object, assay_name = "counts", x_var = "SEX", genes = gene_spec(genes(object)[1:3]), jitter = TRUE, facet_var = "AGE18" ) draw_boxplot( object, assay_name = "counts", genes = gene_spec(c(A = "GeneID:11185", B = "GeneID:10677")), violin = TRUE )
object <- hermes_data draw_boxplot( object, assay_name = "counts", genes = gene_spec(c(A = genes(object)[1])), violin = TRUE ) object2 <- object %>% add_quality_flags() %>% filter() %>% normalize() draw_boxplot( object2, assay_name = "tpm", x_var = "SEX", genes = gene_spec(setNames(genes(object2)[1:10], 1:10), fun = colMeans), facet_var = "RACE", color_var = "AGE18", jitter = TRUE ) draw_boxplot( object, assay_name = "counts", x_var = "SEX", genes = gene_spec(genes(object)[1:3]), jitter = TRUE, facet_var = "AGE18" ) draw_boxplot( object, assay_name = "counts", genes = gene_spec(c(A = "GeneID:11185", B = "GeneID:10677")), violin = TRUE )
This creates a barplot of chromosomes for the AnyHermesData object with the proportions of low expression genes.
draw_genes_barplot( object, chromosomes = c(seq_len(22), "X", "Y", "MT"), include_others = TRUE )
draw_genes_barplot( object, chromosomes = c(seq_len(22), "X", "Y", "MT"), include_others = TRUE )
object |
( |
chromosomes |
( |
include_others |
( |
The ggplot
object with the histogram.
object <- hermes_data # Display chromosomes 1-22, X, Y, and MT. Other chromosomes are displayed in "Others". # To increase readability, we can have flip the coordinate axes. draw_genes_barplot(object) + coord_flip() # Alternatively we can also rotate the x-axis tick labels. draw_genes_barplot(object) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) # Display chromosomes 1 and 2. Other chromosomes are displayed in "Others". draw_genes_barplot(object, chromosomes = c("1", "2")) # Display chromosomes 1 and 2 only. draw_genes_barplot(object, chromosomes = c("1", "2"), include_others = FALSE)
object <- hermes_data # Display chromosomes 1-22, X, Y, and MT. Other chromosomes are displayed in "Others". # To increase readability, we can have flip the coordinate axes. draw_genes_barplot(object) + coord_flip() # Alternatively we can also rotate the x-axis tick labels. draw_genes_barplot(object) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) # Display chromosomes 1 and 2. Other chromosomes are displayed in "Others". draw_genes_barplot(object, chromosomes = c("1", "2")) # Display chromosomes 1 and 2 only. draw_genes_barplot(object, chromosomes = c("1", "2"), include_others = FALSE)
This produces a heatmap of the chosen assay and groups by various sample variables.
draw_heatmap( object, assay_name, color_extremes = c(0.01, 0.99), col_data_annotation = NULL, ... )
draw_heatmap( object, assay_name, color_extremes = c(0.01, 0.99), col_data_annotation = NULL, ... )
object |
( |
assay_name |
( |
color_extremes |
( |
col_data_annotation |
( |
... |
additional arguments to pass to |
The ComplexHeatmap::Heatmap
heatmap
result <- hermes_data %>% normalize(methods = "voom") %>% add_quality_flags() %>% filter(what = "genes") draw_heatmap( object = result[1:10, ], assay_name = "counts", col_data_annotation = "COUNTRY" ) draw_heatmap( object = result[1:10, ], assay_name = "counts", color_extremes = c(0.001, 0.999), col_data_annotation = "AGEGRP" )
result <- hermes_data %>% normalize(methods = "voom") %>% add_quality_flags() %>% filter(what = "genes") draw_heatmap( object = result[1:10, ], assay_name = "counts", col_data_annotation = "COUNTRY" ) draw_heatmap( object = result[1:10, ], assay_name = "counts", color_extremes = c(0.001, 0.999), col_data_annotation = "AGEGRP" )
This creates a density plot of the (log) counts distributions of the AnyHermesData object where each line on the plot corresponds to a sample.
draw_libsize_densities(object, log = TRUE)
draw_libsize_densities(object, log = TRUE)
object |
( |
log |
( |
The ggplot
object with the density plot.
result <- hermes_data draw_libsize_densities(result) draw_libsize_densities(result, log = FALSE)
result <- hermes_data draw_libsize_densities(result) draw_libsize_densities(result, log = FALSE)
This creates a histogram of the library sizes of the AnyHermesData object.
draw_libsize_hist(object, bins = 30L, fill = "darkgrey")
draw_libsize_hist(object, bins = 30L, fill = "darkgrey")
object |
( |
bins |
( |
fill |
( |
The ggplot
object with the histogram.
result <- hermes_data draw_libsize_hist(result) draw_libsize_hist(result, bins = 10L, fill = "blue")
result <- hermes_data draw_libsize_hist(result) draw_libsize_hist(result, bins = 10L, fill = "blue")
This creates a Q-Q plot of the library sizes of the AnyHermesData object.
draw_libsize_qq(object, color = "grey", linetype = "dashed")
draw_libsize_qq(object, color = "grey", linetype = "dashed")
object |
( |
color |
( |
linetype |
( |
The ggplot
object with the Q-Q Plot.
result <- hermes_data draw_libsize_qq(result) draw_libsize_qq(result, color = "blue", linetype = "solid") # We can also add sample names as labels. library(ggrepel) draw_libsize_qq(result) + geom_text_repel(label = colnames(result), stat = "qq")
result <- hermes_data draw_libsize_qq(result) draw_libsize_qq(result, color = "blue", linetype = "solid") # We can also add sample names as labels. library(ggrepel) draw_libsize_qq(result) + geom_text_repel(label = colnames(result), stat = "qq")
This draws a boxplot, with overlaid data points, of the number of non-zero expressed genes per sample.
draw_nonzero_boxplot(object, position = position_jitter(0.2), alpha = 0.25)
draw_nonzero_boxplot(object, position = position_jitter(0.2), alpha = 0.25)
object |
( |
position |
( |
alpha |
( |
The ggplot
object with the boxplot.
# Default boxplot. result <- hermes_data draw_nonzero_boxplot(result) # Reusing the same position for labeling. library(ggrepel) pos <- position_jitter(0.5) draw_nonzero_boxplot(result, position = pos) + geom_text_repel(aes(label = samples(result)), position = pos)
# Default boxplot. result <- hermes_data draw_nonzero_boxplot(result) # Reusing the same position for labeling. library(ggrepel) pos <- position_jitter(0.5) draw_nonzero_boxplot(result, position = pos) + geom_text_repel(aes(label = samples(result)), position = pos)
This produces a scatterplot of two genes or gene signatures.
draw_scatterplot( object, assay_name, x_spec, y_spec, color_var = NULL, facet_var = NULL, smooth_method = c("lm", "loess", "none") )
draw_scatterplot( object, assay_name, x_spec, y_spec, color_var = NULL, facet_var = NULL, smooth_method = c("lm", "loess", "none") )
object |
( |
assay_name |
( |
x_spec |
( |
y_spec |
( |
color_var |
( |
facet_var |
( |
smooth_method |
( |
The ggplot
scatterplot.
object <- hermes_data g <- genes(object) draw_scatterplot( object, assay_name = "counts", facet_var = NULL, x_spec = gene_spec(c(A = g[1])), y_spec = gene_spec(g[2]), color = "RACE" ) object2 <- object %>% add_quality_flags() %>% filter() %>% normalize() g2 <- genes(object2) draw_scatterplot( object2, assay_name = "tpm", facet_var = "SEX", x_spec = gene_spec(g2[1:10], colMeans, "Mean"), y_spec = gene_spec(g2[11:20], colMedians, "Median"), smooth_method = "loess" )
object <- hermes_data g <- genes(object) draw_scatterplot( object, assay_name = "counts", facet_var = NULL, x_spec = gene_spec(c(A = g[1])), y_spec = gene_spec(g[2]), color = "RACE" ) object2 <- object %>% add_quality_flags() %>% filter() %>% normalize() g2 <- genes(object2) draw_scatterplot( object2, assay_name = "tpm", facet_var = "SEX", x_spec = gene_spec(g2[1:10], colMeans, "Mean"), y_spec = gene_spec(g2[11:20], colMedians, "Median"), smooth_method = "loess" )
ExpressionSet
DataThis example data can be used to try out conversion of a Biobase::ExpressionSet
object into a HermesData
object.
expression_set
expression_set
A Biobase::ExpressionSet
object with 20 samples covering 5085
features (Entrez
gene IDs).
This is an artificial dataset designed to resemble real data.
SummarizedExperiment::makeSummarizedExperimentFromExpressionSet()
to convert into a
SummarizedExperiment::SummarizedExperiment
.
summarized_experiment
which contains similar data already as a
SummarizedExperiment::SummarizedExperiment
.
The methods access the names of the variables in colData()
and rowData()
of
the object which are not required by design. So these can be additional sample or
patient characteristics, or gene characteristics.
extraColDataNames(x, ...) ## S4 method for signature 'AnyHermesData' extraColDataNames(x, ...) extraRowDataNames(x, ...) ## S4 method for signature 'AnyHermesData' extraRowDataNames(x, ...)
extraColDataNames(x, ...) ## S4 method for signature 'AnyHermesData' extraColDataNames(x, ...) extraRowDataNames(x, ...) ## S4 method for signature 'AnyHermesData' extraRowDataNames(x, ...)
x |
( |
... |
not used. |
The character vector with the additional variable names in either
colData()
or rowData()
.
object <- hermes_data extraColDataNames(object) extraRowDataNames(object)
object <- hermes_data extraColDataNames(object) extraRowDataNames(object)
AnyHermesData
on Subset Passing Default QC FlagsThis filters a AnyHermesData
object using the default QC flags and required annotations.
filter(object, ...) ## S4 method for signature 'AnyHermesData' filter(object, what = c("genes", "samples"), annotation_required = "size") ## S4 method for signature 'data.frame' filter(object, ...) ## S4 method for signature 'ts' filter(object, ...)
filter(object, ...) ## S4 method for signature 'AnyHermesData' filter(object, what = c("genes", "samples"), annotation_required = "size") ## S4 method for signature 'data.frame' filter(object, ...) ## S4 method for signature 'ts' filter(object, ...)
object |
( |
... |
additional arguments. |
what |
( |
annotation_required |
( |
Only genes without low expression (low_expression_flag
) and samples
without low depth (low_depth_flag
) or technical failure (tech_failure_flag
)
remain in the returned filtered object.
Also required gene annotation columns can be specified, so that genes which are not complete
for these columns are filtered out. By default this is the size
column, which is needed
for default normalization of the object.
The filtered AnyHermesData
object.
The internal implementation cannot use the subset()
method since that
requires non-standard evaluation of arguments.
a <- hermes_data dim(a) # Filter genes and samples on default QC flags. result <- filter(a) dim(result) # Filter only genes without low expression. result <- filter(a, what = "genes") # Filter only samples with low depth and technical failure. result <- filter(a, what = "samples") # Filter only genes, and require certain annotations to be present. result <- filter(a, what = "genes", annotation_required = c("size"))
a <- hermes_data dim(a) # Filter genes and samples on default QC flags. result <- filter(a) dim(result) # Filter only genes without low expression. result <- filter(a, what = "genes") # Filter only samples with low depth and technical failure. result <- filter(a, what = "samples") # Filter only genes, and require certain annotations to be present. result <- filter(a, what = "genes", annotation_required = c("size"))
GeneSpec
ConstructorCreates a new GeneSpec
object.
gene_spec(genes = NULL, fun = NULL, fun_name = deparse(substitute(fun)))
gene_spec(genes = NULL, fun = NULL, fun_name = deparse(substitute(fun)))
genes |
(named |
fun |
( |
fun_name |
( |
A new GeneSpec
object.
gene_spec("GeneID:11185") gene_spec(c("GeneID:11185", "GeneID:10677", "GeneID:101928428"), fun = colMeans)
gene_spec("GeneID:11185") gene_spec(c("GeneID:11185", "GeneID:10677", "GeneID:101928428"), fun = colMeans)
Access the gene IDs, i.e. row names, of a AnyHermesData
object with a
nicely named accessor method.
genes(object) ## S4 method for signature 'AnyHermesData' genes(object)
genes(object) ## S4 method for signature 'AnyHermesData' genes(object)
object |
( |
The character vector with the gene IDs.
samples()
to access the sample IDs.
a <- hermes_data genes(a)
a <- hermes_data genes(a)
A GeneSpec
consists of the gene IDs (possibly named with labels),
the summary function and the name of the summary function.
new()
Creates a new GeneSpec
object.
GeneSpec$new(genes = NULL, fun = NULL, fun_name = deparse(substitute(fun)))
genes
(named character
or NULL
)
the gene IDs, where the names
are used as labels if available.
fun
(function
or NULL
)
summary function. If NULL
is
used then multiple genes are not summarized but returned as a matrix from the
extract
method.
fun_name
(string
)
name of the summary function.
A new GeneSpec
object.
get_genes()
Returns the genes.
GeneSpec$get_genes()
get_gene_labels()
Returns the gene labels (substituted by gene IDs if not available).
GeneSpec$get_gene_labels(genes = self$get_genes())
genes
(character
)
for which subset of genes the labels should be returned.
returns_vector()
Predicate whether the extract returns a vector or not.
GeneSpec$returns_vector()
get_label()
Returns a string which can be used e.g. for plot labels.
GeneSpec$get_label(genes = self$get_genes())
genes
(character
)
for which subset of genes the labels should be returned.
extract()
Extract the gene values from an assay as specified.
GeneSpec$extract(assay)
assay
(matrix
)
original matrix with rownames containing the
specified genes.
Either a vector with one value per column, or a matrix with multiple genes in the rows.
extract_data_frame()
Extract the gene values as a data.frame
.
GeneSpec$extract_data_frame(assay)
assay
(matrix
)
original matrix with rownames containing the
specified genes.
A data.frame
with the genes in the columns and the samples
in the rows.
clone()
The objects of this class are cloneable with this method.
GeneSpec$clone(deep = FALSE)
deep
Whether to make a deep clone.
# Minimal specification if only one gene is used. x_spec <- gene_spec("GeneID:1820") # Using multiple genes with a signature. x_spec <- gene_spec(c("GeneID:1820", "GeneID:52"), fun = colMeans) x_spec <- gene_spec(c("GeneID:1820", "GeneID:52"), fun = colPrinComp1) x_spec$returns_vector() x_spec$get_genes() x_spec$get_gene_labels() x_spec$get_label() # Using multiple genes with partial labels, without a signature. x_spec <- gene_spec(c(A = "GeneID:1820", "GeneID:52")) x_spec$returns_vector() x_spec$get_gene_labels() # Use the gene specification to extract genes from a matrix. mat <- matrix( data = rpois(15, 10), nrow = 3, ncol = 5, dimnames = list(c("GeneID:1820", "GeneID:52", "GeneID:523"), NULL) ) x_spec$extract(mat) # We can also extract these as a `data.frame`. x_spec$extract_data_frame(mat)
# Minimal specification if only one gene is used. x_spec <- gene_spec("GeneID:1820") # Using multiple genes with a signature. x_spec <- gene_spec(c("GeneID:1820", "GeneID:52"), fun = colMeans) x_spec <- gene_spec(c("GeneID:1820", "GeneID:52"), fun = colPrinComp1) x_spec$returns_vector() x_spec$get_genes() x_spec$get_gene_labels() x_spec$get_label() # Using multiple genes with partial labels, without a signature. x_spec <- gene_spec(c(A = "GeneID:1820", "GeneID:52")) x_spec$returns_vector() x_spec$get_gene_labels() # Use the gene specification to extract genes from a matrix. mat <- matrix( data = rpois(15, 10), nrow = 3, ncol = 5, dimnames = list(c("GeneID:1820", "GeneID:52", "GeneID:523"), NULL) ) x_spec$extract(mat) # We can also extract these as a `data.frame`. x_spec$extract_data_frame(mat)
The difference here to duplicated()
is that also the first occurrence
of a duplicate is flagged as TRUE
.
h_all_duplicated(x)
h_all_duplicated(x)
x |
a vector or a data frame or an array or |
Logical vector flagging all occurrences of duplicate values as TRUE
.
h_all_duplicated(c("a", "a", "b")) duplicated(c("a", "a", "b"))
h_all_duplicated(c("a", "a", "b")) duplicated(c("a", "a", "b"))
data.frame
This helper function converts all character and logical variables
to factor variables in a data.frame
. It also sets an explicit missing data level
for all factor variables that have at least one NA
. Empty strings are handled
as NA
.
h_df_factors_with_explicit_na(data, na_level = "<Missing>")
h_df_factors_with_explicit_na(data, na_level = "<Missing>")
data |
( |
na_level |
( |
The modified data.
dat <- data.frame( a = c(NA, 2), b = c("A", NA), c = c("C", "D"), d = factor(c(NA, "X")), e = factor(c("Y", "Z")) ) h_df_factors_with_explicit_na(dat)
dat <- data.frame( a = c(NA, 2), b = c("A", NA), c = c("C", "D"), d = factor(c(NA, "X")), e = factor(c("Y", "Z")) ) h_df_factors_with_explicit_na(dat)
DESeq2
Differential Expression AnalysisThis helper functions performs the differential expression analysis with
DESeq2::DESeq()
for a given AnyHermesData input and design
matrix.
h_diff_expr_deseq2(object, design, ...)
h_diff_expr_deseq2(object, design, ...)
object |
( |
design |
( |
... |
additional arguments internally passed to |
A data frame with columns log2_fc
(estimated log2 fold change),
stat
(Wald statistic), p_val
(raw p-value), adj_p_pval
(Benjamini-Hochberg adjusted p-value).
Love MI, Huber W, Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15(12), 550. doi:10.1186/s13059-014-0550-8.
object <- hermes_data # Create the design matrix corresponding to the factor of interest. design <- model.matrix(~SEX, colData(object)) # Then perform the `DESeq2` differential expression analysis. result <- h_diff_expr_deseq2(object, design) head(result) # Change of the `fitType` can be required in some cases. result2 <- h_diff_expr_deseq2(object, design, fitType = "local") head(result2)
object <- hermes_data # Create the design matrix corresponding to the factor of interest. design <- model.matrix(~SEX, colData(object)) # Then perform the `DESeq2` differential expression analysis. result <- h_diff_expr_deseq2(object, design) head(result) # Change of the `fitType` can be required in some cases. result2 <- h_diff_expr_deseq2(object, design, fitType = "local") head(result2)
limma
/voom Differential Expression AnalysisThis helper functions performs the differential expression analysis with the voom
method from the limma
package (via limma::voom()
, limma::lmFit()
and limma::eBayes()
)
for given counts in a AnyHermesData object and a corresponding design
matrix.
h_diff_expr_voom(object, design, ...)
h_diff_expr_voom(object, design, ...)
object |
( |
design |
( |
... |
additional arguments internally passed to |
A data frame with columns log2_fc
(estimated log2 fold change),
stat
(moderated t-statistic), p_val
(raw p-value), adj_p_pval
(Benjamini-Hochberg adjusted p-value).
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research, 43(7), e47. doi:10.1093/nar/gkv007.
Law CW, Chen Y, Shi W, Smyth GK (2014). “voom: precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biology, 15(2), R29. doi:10.1186/gb-2014-15-2-r29.
object <- hermes_data # Create the design matrix corresponding to the factor of interest. design <- model.matrix(~SEX, colData(object)) # Then perform the differential expression analysis. result <- h_diff_expr_voom(object, design) head(result) # Sometimes we might want to specify method details. result2 <- h_diff_expr_voom(object, design, trend = TRUE, robust = TRUE) head(result2)
object <- hermes_data # Create the design matrix corresponding to the factor of interest. design <- model.matrix(~SEX, colData(object)) # Then perform the differential expression analysis. result <- h_diff_expr_voom(object, design) head(result) # Sometimes we might want to specify method details. result2 <- h_diff_expr_voom(object, design, trend = TRUE, robust = TRUE) head(result2)
Ensembl
to Entrez
Gene IDsThis helper function queries BioMart
to translate Ensembl
to Entrez
Gene IDs.
h_ensembl_to_entrez_ids(gene_ids, mart)
h_ensembl_to_entrez_ids(gene_ids, mart)
gene_ids |
( |
mart |
( |
Character vector of Entrez
gene IDs.
if (interactive()) { mart <- biomaRt::useMart("ensembl", dataset = "hsapiens_gene_ensembl") h_ensembl_to_entrez_ids(c("ENSG00000135407", "ENSG00000241644"), mart) }
if (interactive()) { mart <- biomaRt::useMart("ensembl", dataset = "hsapiens_gene_ensembl") h_ensembl_to_entrez_ids(c("ENSG00000135407", "ENSG00000241644"), mart) }
BioMart
Helper function to query annotations from biomaRt
, for cleaned up gene IDs of
a specific ID variable and given biomaRt::Mart
.
h_get_annotation_biomart(gene_ids, id_var, mart)
h_get_annotation_biomart(gene_ids, id_var, mart)
gene_ids |
( |
id_var |
( |
mart |
( |
A data frame with columns:
id_var
(depending on what was used)
hgnc_symbol
entrezgene_description
chromosome_name
size
refseq_mrna
refseq_peptide
if (interactive()) { mart <- biomaRt::useMart("ensembl", dataset = "hsapiens_gene_ensembl") h_get_annotation_biomart(c("11185", "10677"), id_var = "entrezgene_id", mart = mart) }
if (interactive()) { mart <- biomaRt::useMart("ensembl", dataset = "hsapiens_gene_ensembl") h_get_annotation_biomart(c("11185", "10677"), id_var = "entrezgene_id", mart = mart) }
BioMart
Coordinates into GRanges
This function extracts the chromosome number, the start position and the end position of transcripts
in given data.frame
with coordinates as returned by biomaRt::getBM()
and converts
them to a GRanges
object.
h_get_granges_by_id(coords, id)
h_get_granges_by_id(coords, id)
coords |
( |
id |
( |
GRange
objects for the respective single gene ID.
if (interactive()) { mart <- biomaRt::useMart("ensembl", dataset = "hsapiens_gene_ensembl") attrs <- c( "ensembl_gene_id", "ensembl_exon_id", "chromosome_name", "exon_chrom_start", "exon_chrom_end" ) coords <- biomaRt::getBM( filters = "entrezgene_id", attributes = attrs, values = c("11185", "10677"), mart = mart ) h_get_granges_by_id(coords, "ENSG00000135407") }
if (interactive()) { mart <- biomaRt::useMart("ensembl", dataset = "hsapiens_gene_ensembl") attrs <- c( "ensembl_gene_id", "ensembl_exon_id", "chromosome_name", "exon_chrom_start", "exon_chrom_end" ) coords <- biomaRt::getBM( filters = "entrezgene_id", attributes = attrs, values = c("11185", "10677"), mart = mart ) h_get_granges_by_id(coords, "ENSG00000135407") }
This helper function queries BioMart
for lengths of genes by adding up all
exon lengths after reducing overlaps.
h_get_size_biomart(gene_ids, id_var, mart)
h_get_size_biomart(gene_ids, id_var, mart)
gene_ids |
( |
id_var |
( |
mart |
( |
Named integer vector indicating the gene lengths.
if (interactive()) { mart <- biomaRt::useMart("ensembl", dataset = "hsapiens_gene_ensembl") h_get_size_biomart("11185", "entrezgene_id", mart) h_get_size_biomart("ENSG00000215417", "ensembl_gene_id", mart) h_get_size_biomart(c("11185", "10677"), "entrezgene_id", mart) h_get_size_biomart(c("ENSG00000135407", "ENSG00000215417"), "ensembl_gene_id", mart) }
if (interactive()) { mart <- biomaRt::useMart("ensembl", dataset = "hsapiens_gene_ensembl") h_get_size_biomart("11185", "entrezgene_id", mart) h_get_size_biomart("ENSG00000215417", "ensembl_gene_id", mart) h_get_size_biomart(c("11185", "10677"), "entrezgene_id", mart) h_get_size_biomart(c("ENSG00000135407", "ENSG00000215417"), "ensembl_gene_id", mart) }
This helper function determines for each gene in the object whether all required annotation columns are filled.
h_has_req_annotations(object, annotation_required)
h_has_req_annotations(object, annotation_required)
object |
( |
annotation_required |
( |
Named logical vector with one value for each gene in object
, which is TRUE
if all
required annotation columns are filled, and otherwise FALSE
.
filter()
where this is used internally.
object <- hermes_data result <- h_has_req_annotations(object, "size") all(result) rowData(object)$size[1] <- NA # nolint which(!h_has_req_annotations(object, "size"))
object <- hermes_data result <- h_has_req_annotations(object, "size") all(result) rowData(object)$size[1] <- NA # nolint which(!h_has_req_annotations(object, "size"))
This is used by the rename
method. It wraps the assertions and the
matching used several times.
h_map_pos(names, map)
h_map_pos(names, map)
names |
( |
map |
(named |
Integer vector of the positions of the map
values in the names
.
h_map_pos(c("a", "b"), c(d = "b"))
h_map_pos(c("a", "b"), c(d = "b"))
This helper function adds parentheses around each element of a character vector.
h_parens(x)
h_parens(x)
x |
( |
Character vector with parentheses, except when x
is a blank string
in which case it is returned unaltered.
h_parens("bla") h_parens("") h_parens(c("bla", "bli"))
h_parens("bla") h_parens("") h_parens(c("bla", "bli"))
This function processes sample variables from AnyHermesData
and the
corresponding principal components matrix, and then generates the matrix of R2 values.
h_pca_df_r2_matrix(pca, df)
h_pca_df_r2_matrix(pca, df)
pca |
( |
df |
( |
Note that only the df
columns which are numeric
, character
, factor
or
logical
are included in the resulting matrix, because other variable types are not
supported.
In addition, df
columns which are constant, all NA
, or character
or factor
columns with too many levels are also dropped before the analysis.
A matrix with R2 values for all combinations of sample variables and principal components.
h_pca_var_rsquared()
which is used internally to calculate the R2 for one
sample variable.
object <- hermes_data %>% add_quality_flags() %>% filter() %>% normalize() # Obtain the principal components. pca <- calc_pca(object)$x # Obtain the `colData` as a `data.frame`. df <- as.data.frame(colData(object)) # Correlate them. r2_all <- h_pca_df_r2_matrix(pca, df) str(r2_all) # We can see that only about half of the columns from `df` were # used for the correlations. ncol(r2_all) ncol(df)
object <- hermes_data %>% add_quality_flags() %>% filter() %>% normalize() # Obtain the principal components. pca <- calc_pca(object)$x # Obtain the `colData` as a `data.frame`. df <- as.data.frame(colData(object)) # Correlate them. r2_all <- h_pca_df_r2_matrix(pca, df) str(r2_all) # We can see that only about half of the columns from `df` were # used for the correlations. ncol(r2_all) ncol(df)
This helper function calculates R2 values between one sample variable from AnyHermesData
and all Principal Components (PCs) separately (one linear model is fit for each PC).
h_pca_var_rsquared(pca, x)
h_pca_var_rsquared(pca, x)
pca |
( |
x |
( |
Note that in case there are estimation problems for any of the PCs, then NA
will
be returned for those.
A vector with R2 values for each principal component.
object <- hermes_data %>% add_quality_flags() %>% filter() %>% normalize() # Obtain the principal components. pca <- calc_pca(object)$x # Obtain the sample variable. x <- colData(object)$AGE18 # Correlate them. r2 <- h_pca_var_rsquared(pca, x)
object <- hermes_data %>% add_quality_flags() %>% filter() %>% normalize() # Obtain the principal components. pca <- calc_pca(object)$x # Obtain the sample variable. x <- colData(object)$AGE18 # Correlate them. r2 <- h_pca_var_rsquared(pca, x)
This helper function makes a short list string, e.g. "a, b, ..., z"
out of a character vector, e.g. letters
.
h_short_list(x, sep = ", ", thresh = 3L)
h_short_list(x, sep = ", ", thresh = 3L)
x |
( |
sep |
( |
thresh |
( |
String with the short list.
h_short_list(letters) h_short_list(letters[1:3]) h_short_list(LETTERS[1:5], sep = ";", thresh = 5L)
h_short_list(letters) h_short_list(letters[1:3]) h_short_list(LETTERS[1:5], sep = ";", thresh = 5L)
This helper function removes the prefix and possible delimiter from a vector of gene IDs, such that only the digits are returned.
h_strip_prefix(gene_ids, prefix)
h_strip_prefix(gene_ids, prefix)
gene_ids |
( |
prefix |
( |
Character vector that contains only the digits for each gene ID.
This is currently used to strip away the GeneID
prefix from Entrez
gene IDs
so that they can be queried from BioMart
h_strip_prefix(c("GeneID:11185", "GeneID:10677"), prefix = "GeneID")
h_strip_prefix(c("GeneID:11185", "GeneID:10677"), prefix = "GeneID")
This helper function generates a set of unique labels given unique IDs and not necessarily unique names.
h_unique_labels(ids, nms = NULL)
h_unique_labels(ids, nms = NULL)
ids |
( |
nms |
( |
Character vector where empty names are replaced by the IDs and non-unique names are made unique by appending the IDs in parentheses.
h_unique_labels(c("1", "2", "3"), c("A", "B", "A")) h_unique_labels(NULL) h_unique_labels(c("1", "2", "3"))
h_unique_labels(c("1", "2", "3"), c("A", "B", "A")) h_unique_labels(NULL) h_unique_labels(c("1", "2", "3"))
HermesData
DataThis example HermesData
is created from the underlying SummarizedExperiment::SummarizedExperiment
object by renaming descriptors to align with standard specification. It already
contains the required columns in rowData
and colData
.
hermes_data
hermes_data
A HermesData object with 20 samples covering
5085 features (Entrez
gene IDs).
This is an artificial dataset designed to resemble real data.
summarized_experiment
for the underlying SummarizedExperiment::SummarizedExperiment
object.
HermesData
and RangedHermesData
The HermesData
class is an extension of SummarizedExperiment::SummarizedExperiment
with additional validation criteria.
HermesData(object) HermesDataFromMatrix(counts, ...)
HermesData(object) HermesDataFromMatrix(counts, ...)
object |
( |
counts |
( |
... |
additional arguments, e.g. |
The additional criteria are:
The first assay must be counts
containing non-missing, integer, non-negative values.
Note that rename()
can be used to edit the assay name to counts
if needed.
The following columns must be in rowData
:
symbol
(also often called HGNC
or similar, example: "INMT"
)
desc
(the gene name, example: "indolethylamine N-methyltransferase"
)
chromosome
(the chromosome as string, example: "7"
)
size
(the size of the gene in base pairs, e.g 5468
)
low_expression_flag
(can be populated with add_quality_flags()
)
The following columns must be in colData
:
low_depth_flag
(can be populated with add_quality_flags()
)
tech_failure_flag
(can be populated with add_quality_flags()
)
The object must have unique row and column names. The row names are the gene names and the column names are the sample names.
Analogously, RangedHermesData
is an extension of
SummarizedExperiment::RangedSummarizedExperiment
and has the same
additional validation requirements. Methods can be defined for both classes at the
same time with the AnyHermesData
signature.
A Biobase::ExpressionSet
object can be imported by using the
SummarizedExperiment::makeSummarizedExperimentFromExpressionSet()
function to
first convert it to a SummarizedExperiment::SummarizedExperiment
object before
converting it again into a HermesData
object.
An object of class AnyHermesData
(HermesData
or RangedHermesData
).
prefix
common prefix of the gene IDs (row names).
Note that we use S4Vectors::setValidity2()
to define the validity
method, which allows us to turn off the validity checks in internal
functions where intermediate objects may not be valid within the scope of
the function.
It can be helpful to convert character and logical variables to factors in colData()
(before or after the HermesData
creation). We provide the utility function
df_cols_to_factor()
to simplify this task, but leave it to the user to allow
for full control of the details.
rename()
for renaming columns of the input data.
# Convert an `ExpressionSet` to a `RangedSummarizedExperiment`. ranged_summarized_experiment <- makeSummarizedExperimentFromExpressionSet(expression_set) # Then convert to `RangedHermesData`. HermesData(ranged_summarized_experiment) # Create objects starting from a `SummarizedExperiment`. hermes_data <- HermesData(summarized_experiment) hermes_data # Create objects from a matrix. Note that additional arguments are not required but possible. counts_matrix <- assay(summarized_experiment) counts_hermes_data <- HermesDataFromMatrix(counts_matrix)
# Convert an `ExpressionSet` to a `RangedSummarizedExperiment`. ranged_summarized_experiment <- makeSummarizedExperimentFromExpressionSet(expression_set) # Then convert to `RangedHermesData`. HermesData(ranged_summarized_experiment) # Create objects starting from a `SummarizedExperiment`. hermes_data <- HermesData(summarized_experiment) hermes_data # Create objects from a matrix. Note that additional arguments are not required but possible. counts_matrix <- assay(summarized_experiment) counts_hermes_data <- HermesDataFromMatrix(counts_matrix)
This is a useful function when trying to join genetic with CDISC data sets.
inner_join_cdisc( gene_data, cdisc_data, patient_key = "USUBJID", additional_keys = character() )
inner_join_cdisc( gene_data, cdisc_data, patient_key = "USUBJID", additional_keys = character() )
gene_data |
( |
cdisc_data |
( |
patient_key |
( |
additional_keys |
( |
A data.frame
which contains columns from both data sets merged by the keys.
Columns which are contained in both data sets but are not specified as keys are taken
from gene_data
and not from cdisc_data
.
gene_data <- col_data_with_genes(hermes_data, "counts", gene_spec("GeneID:1820")) cdisc_data <- data.frame( USUBJID = head(gene_data$USUBJID, 10), extra = 1:10 ) result <- inner_join_cdisc(gene_data, cdisc_data) result
gene_data <- col_data_with_genes(hermes_data, "counts", gene_spec("GeneID:1820")) cdisc_data <- data.frame( USUBJID = head(gene_data$USUBJID, 10), extra = 1:10 ) result <- inner_join_cdisc(gene_data, cdisc_data) result
SummarizedExperiment
This method checks whether a SummarizedExperiment::SummarizedExperiment
object is empty.
## S4 method for signature 'SummarizedExperiment' isEmpty(x)
## S4 method for signature 'SummarizedExperiment' isEmpty(x)
x |
( |
Flag whether the object
is empty.
isEmpty(summarized_experiment) isEmpty(summarized_experiment[NULL, ]) isEmpty(hermes_data)
isEmpty(summarized_experiment) isEmpty(summarized_experiment[NULL, ]) isEmpty(hermes_data)
lapply
method for MultiAssayExperiment
Apply a function on all experiments in an MAE.
## S4 method for signature 'MultiAssayExperiment' lapply(X, FUN, safe = TRUE, ...)
## S4 method for signature 'MultiAssayExperiment' lapply(X, FUN, safe = TRUE, ...)
X |
( |
FUN |
( |
safe |
( |
... |
additional arguments passed to |
MultiAssayExperiment
object with specified function applied.
object <- multi_assay_experiment result <- lapply(object, normalize, safe = TRUE) # Similarly, all experiments in an MAE can be converted to HermesData class: result <- lapply(object, HermesData, safe = TRUE)
object <- multi_assay_experiment result <- lapply(object, normalize, safe = TRUE) # Similarly, all experiments in an MAE can be converted to HermesData class: result <- lapply(object, HermesData, safe = TRUE)
These methods access or set the metadata in a AnyHermesData
object.
x |
( |
value |
( |
The metadata which is a list.
Note that this just inherits S4Vectors::metadata,Annotated-method()
.
a <- hermes_data metadata(a) metadata(a) <- list(new = "my metadata") metadata(a)
a <- hermes_data metadata(a) metadata(a) <- list(new = "my metadata") metadata(a)
MultiAssayExperiment
DataThis example MultiAssayExperiment::MultiAssayExperiment
can be used as test data.
multi_assay_experiment
multi_assay_experiment
A MultiAssayExperiment::MultiAssayExperiment
object with 3 separate HermesData
objects.
The first object contains 5 samples and covers 1000 features (Entrez
gene IDs).
The second object contains 9 samples with 2500 features.
The third object contains 6 samples with 1300 features.
This is an artificial dataset designed to resemble real data.
AnyHermesData
ObjectsThe normalize()
method is normalizing the input AnyHermesData
according to one or more
specified normalization methods. The results are saved as additional assays
in the object.
Possible normalization methods (which are implemented with separate helper functions):
cpm: Counts per Million (CPM
). Separately by sample, the original counts of the genes
are divided by the library size of this sample, and multiplied by one million. This is the
appropriate normalization for between-sample comparisons.
rpkm: Reads per Kilobase of transcript per Million reads mapped (RPKM). Each gene count is divided by the gene size (in kilobases) and then again divided by the library sizes of each sample (in millions). This allows for within-sample comparisons, as it takes into account the gene sizes - longer genes will always have more counts than shorter genes.
tpm: Transcripts per Million (TPM). This addresses the problem of RPKM being inconsistent across samples (which can be seen that the sum of all RPKM values will vary from sample to sample). Therefore here we divide the RPKM by the sum of all RPKM values for each sample, and multiply by one million.
voom: VOOM normalization. This is essentially just a slight variation of CPM
where
a prior_count
of 0.5 is combined with lib_sizes
increased by 1 for each sample. Note that
this is not required for the corresponding differential expression analysis, but just provided
as a complementary experimental normalization approach here.
vst
: Variance stabilizing transformation. This is to transform the normalized
count data for all genes into approximately homoskedastic values (having constant variance).
rlog
: The transformation to the log2 scale values with approximately homoskedastic values.
## S4 method for signature 'AnyHermesData' normalize( object, methods = c("cpm", "rpkm", "tpm", "voom", "vst"), control = control_normalize(), ... ) h_cpm(object, control = control_normalize()) h_rpkm(object, control = control_normalize()) h_tpm(object, control = control_normalize()) h_voom(object, control = control_normalize()) h_vst(object, control = control_normalize()) h_rlog(object, control = control_normalize())
## S4 method for signature 'AnyHermesData' normalize( object, methods = c("cpm", "rpkm", "tpm", "voom", "vst"), control = control_normalize(), ... ) h_cpm(object, control = control_normalize()) h_rpkm(object, control = control_normalize()) h_tpm(object, control = control_normalize()) h_voom(object, control = control_normalize()) h_vst(object, control = control_normalize()) h_rlog(object, control = control_normalize())
object |
( |
methods |
( |
control |
(named |
... |
not used. |
The AnyHermesData
object with additional assays containing the normalized counts.
The control
is saved in the metadata
of the object for future reference.
h_cpm()
: calculates the Counts per Million (CPM
) normalized counts.
h_rpkm()
: calculates the Reads per Kilobase per Million (RPKM) normalized counts.
h_tpm()
: calculates the Transcripts per Million (TPM) normalized counts.
h_vst()
: variance stabilizing transformation (vst
) from DESeq2
package.
h_rlog()
: regularized log transformation (rlog
) from DESeq2
package.
control_normalize()
to define the normalization method settings.
a <- hermes_data # By default, log values are used with a prior count of 1 added to original counts. result <- normalize(a) assayNames(result) tpm <- assay(result, "tpm") tpm[1:3, 1:3] # We can also work on original scale. result_orig <- normalize(a, control = control_normalize(log = FALSE)) tpm_orig <- assay(result_orig, "tpm") tpm_orig[1:3, 1:3] # Separate calculation of the CPM normalized counts. counts_cpm <- h_cpm(a) str(counts_cpm) # Separate calculation of the RPKM normalized counts. counts_rpkm <- h_rpkm(a) str(counts_rpkm) # Separate calculation of the TPM normalized counts. counts_tpm <- h_tpm(a) str(counts_tpm) # Separate calculation of the VOOM normalized counts. counts_voom <- h_voom(a) str(counts_voom) # Separate calculation of the vst transformation. counts_vst <- h_vst(a) str(counts_vst) # Separate calculation of the rlog transformation. counts_rlog <- h_rlog(a) str(counts_rlog)
a <- hermes_data # By default, log values are used with a prior count of 1 added to original counts. result <- normalize(a) assayNames(result) tpm <- assay(result, "tpm") tpm[1:3, 1:3] # We can also work on original scale. result_orig <- normalize(a, control = control_normalize(log = FALSE)) tpm_orig <- assay(result_orig, "tpm") tpm_orig[1:3, 1:3] # Separate calculation of the CPM normalized counts. counts_cpm <- h_cpm(a) str(counts_cpm) # Separate calculation of the RPKM normalized counts. counts_rpkm <- h_rpkm(a) str(counts_rpkm) # Separate calculation of the TPM normalized counts. counts_tpm <- h_tpm(a) str(counts_tpm) # Separate calculation of the VOOM normalized counts. counts_voom <- h_voom(a) str(counts_voom) # Separate calculation of the vst transformation. counts_vst <- h_vst(a) str(counts_vst) # Separate calculation of the rlog transformation. counts_rlog <- h_rlog(a) str(counts_rlog)
Generic function to access the prefix from an object.
prefix(object, ...)
prefix(object, ...)
object |
( |
... |
additional arguments. |
The prefix
slot contents.
a <- hermes_data prefix(a)
a <- hermes_data prefix(a)
The generic function query()
is the interface for querying gene annotations from
a data base connection.
query(genes, connection) ## S4 method for signature 'character,ConnectionBiomart' query(genes, connection)
query(genes, connection) ## S4 method for signature 'character,ConnectionBiomart' query(genes, connection)
genes |
( |
connection |
(connection class) |
A method is provided for the ConnectionBiomart
class. However, the framework
is extensible: It is simple to add new connections and corresponding query methods
for other data bases, e.g. company internal data bases. Please make sure to
follow the required format of the returned value.
The BioMart
queries might not return information for all the genes. This can be
due to different versions being used in the gene IDs and the queried Ensembl
data base.
A S4Vectors::DataFrame
with the gene annotations. It is required that:
The rownames
are identical to the input genes
.
The colnames
are equal to the annotation columns .row_data_annotation_cols
.
Therefore, missing information needs to be properly included in the DataFrame
with NA
entries.
if (interactive()) { object <- hermes_data connection <- connect_biomart(prefix(object)) result <- query(genes(object), connection) head(result) head(annotation(object)) }
if (interactive()) { object <- hermes_data connection <- connect_biomart(prefix(object)) result <- query(genes(object), connection) head(result) head(annotation(object)) }
AnyHermesData
ObjectsThis method combines AnyHermesData
objects with the same samples but different
features of interest (rows in assays).
... |
( |
The combined AnyHermesData
object.
Note that this just inherits
SummarizedExperiment::rbind,SummarizedExperiment-method()
. When binding a
AnyHermesData
object with a SummarizedExperiment::SummarizedExperiment
object, then the result will be a
SummarizedExperiment::SummarizedExperiment
object (the more general
class).
Note that we need to have unique gene IDs (row names) and the same prefix across the combined object.
cbind
to column bind objects.
a <- hermes_data[1:2542, ] b <- hermes_data[2543:5085, ] result <- rbind(a, b) class(result)
a <- hermes_data[1:2542, ] b <- hermes_data[2543:5085, ] result <- rbind(a, b) class(result)
SummarizedExperiment
ObjectsThis method renames columns of the rowData
and colData
, as well as assays, of
SummarizedExperiment::SummarizedExperiment
objects. This increases the flexibility
since renaming can be done before conversion to a HermesData
object.
## S4 method for signature 'SummarizedExperiment' rename( x, row_data = character(), col_data = character(), assays = character(), ... ) ## S4 method for signature 'data.frame' rename(x, ...)
## S4 method for signature 'SummarizedExperiment' rename( x, row_data = character(), col_data = character(), assays = character(), ... ) ## S4 method for signature 'data.frame' rename(x, ...)
x |
( |
row_data |
(named |
col_data |
(named |
assays |
(named |
... |
additional arguments (not used here). |
The SummarizedExperiment::SummarizedExperiment
object with renamed contents.
x <- summarized_experiment # Use deliberately a non-standard assay name in this example. assayNames(x) <- "count" # Rename `HGNC` to `symbol` in the `rowData`. x <- rename(x, row_data = c(symbol = "HGNC")) head(names(rowData(x))) # Rename `LowDepthFlag` to `low_depth_flag` in `colData`. x <- rename(x, col_data = c(low_depth_flag = "LowDepthFlag")) tail(names(colData(x))) # Rename assay `count` to `counts`. x <- rename(x, assays = c(counts = "count")) assayNames(x)
x <- summarized_experiment # Use deliberately a non-standard assay name in this example. assayNames(x) <- "count" # Rename `HGNC` to `symbol` in the `rowData`. x <- rename(x, row_data = c(symbol = "HGNC")) head(names(rowData(x))) # Rename `LowDepthFlag` to `low_depth_flag` in `colData`. x <- rename(x, col_data = c(low_depth_flag = "LowDepthFlag")) tail(names(colData(x))) # Rename assay `count` to `counts`. x <- rename(x, assays = c(counts = "count")) assayNames(x)
Access the sample IDs, i.e. col names, of a AnyHermesData
object with a
nicely named accessor method.
## S4 method for signature 'AnyHermesData' samples(object)
## S4 method for signature 'AnyHermesData' samples(object)
object |
( |
The character vector with the sample IDs.
genes()
to access the gene IDs.
a <- hermes_data samples(a)
a <- hermes_data samples(a)
Setter function which allows the user to define a sample manually as a technical failure.
set_tech_failure(object, sample_ids)
set_tech_failure(object, sample_ids)
object |
( |
sample_ids |
( |
AnyHermesData
object with modified technical failure flags.
add_quality_flags()
which automatically sets all (gene and sample) quality flags,
including these technical failure flags.
# Manually flag technical failures in a `AnyHermesData` object. object <- hermes_data get_tech_failure(object)["06520101B0017R"] result <- set_tech_failure(object, c("06520101B0017R", "06520047C0017R")) get_tech_failure(result)["06520101B0017R"]
# Manually flag technical failures in a `AnyHermesData` object. object <- hermes_data get_tech_failure(object)["06520101B0017R"] result <- set_tech_failure(object, c("06520101B0017R", "06520047C0017R")) get_tech_failure(result)["06520101B0017R"]
AnyHermesData
ObjectsA show method that displays high-level information of AnyHermesData
objects.
## S4 method for signature 'HermesData' show(object) ## S4 method for signature 'RangedHermesData' show(object)
## S4 method for signature 'HermesData' show(object) ## S4 method for signature 'RangedHermesData' show(object)
object |
( |
None (invisible NULL
), only used for the side effect of printing to
the console.
The same method is used for both HermesData
and RangedHermesData
objects. We need to define this separately to have this method used instead of
the one inherited from SummarizedExperiment::SummarizedExperiment
.
object <- hermes_data object
object <- hermes_data object
AnyHermesData
ObjectsThis method subsets AnyHermesData
objects, based on expressions involving the
rowData
columns and the colData
columns.
x |
( |
subset |
( |
select |
( |
The subsetted AnyHermesData
object.
Note that this just inherits
SummarizedExperiment::subset,SummarizedExperiment-method()
.
a <- hermes_data a # Subset both genes and samples. subset(a, subset = low_expression_flag, select = DISCSTUD == "N") # Subset only genes. subset(a, subset = chromosome == "2") # Subset only samples. subset(a, select = AGE > 18)
a <- hermes_data a # Subset both genes and samples. subset(a, subset = low_expression_flag, select = DISCSTUD == "N") # Subset only genes. subset(a, subset = chromosome == "2") # Subset only samples. subset(a, select = AGE > 18)
SummarizedExperiment
DataThis example SummarizedExperiment::SummarizedExperiment
can be used to create a
HermesData
object. It already contains the required columns in rowData
and colData
.
summarized_experiment
summarized_experiment
A SummarizedExperiment::SummarizedExperiment object with 20 samples covering
5085 features (Entrez
gene IDs).
This is an artificial dataset designed to resemble real data.
expression_set
which contains similar data as a Biobase::ExpressionSet
.
AnyHermesData
ObjectsProvides a concise summary of the content of AnyHermesData
objects.
summary(object, ...) ## S4 method for signature 'AnyHermesData' summary(object) ## S4 method for signature 'HermesDataSummary' show(object)
summary(object, ...) ## S4 method for signature 'AnyHermesData' summary(object) ## S4 method for signature 'HermesDataSummary' show(object)
object |
( |
... |
not used. |
An object of the corresponding summary class, here
HermesDataSummary
.
summary(AnyHermesData)
: A summary method for AnyHermesData
object that
creates a HermesDataSummary
object.
show(HermesDataSummary)
: A show method prints summary description of HermesDataSummary
object
generated by the summary()
method.
object <- hermes_data object_summary <- summary(object) # We can access parts of this S4 object with the `slot` function. str(object_summary) slotNames(object_summary) slot(object_summary, "lib_sizes") # Just calling the summary method like this will use the `show()` method. summary(object)
object <- hermes_data object_summary <- summary(object) # We can access parts of this S4 object with the `slot` function. str(object_summary) slotNames(object_summary) slot(object_summary, "lib_sizes") # Just calling the summary method like this will use the `show()` method. summary(object)
top_genes()
creates a HermesDataTopGenes
object, which extends data.frame
. It
contains two columns:
expression
: containing the statistic values calculated by summary_fun
across columns.
name
: the gene names.
The corresponding autoplot()
method then visualizes the result as a barplot.
top_genes( object, assay_name = "counts", summary_fun = rowMeans, n_top = if (is.null(min_threshold)) 10L else NULL, min_threshold = NULL ) ## S4 method for signature 'HermesDataTopGenes' autoplot( object, x_lab = "HGNC gene names", y_lab = paste0(object@summary_fun_name, "(", object@assay_name, ")"), title = "Top most expressed genes" )
top_genes( object, assay_name = "counts", summary_fun = rowMeans, n_top = if (is.null(min_threshold)) 10L else NULL, min_threshold = NULL ) ## S4 method for signature 'HermesDataTopGenes' autoplot( object, x_lab = "HGNC gene names", y_lab = paste0(object@summary_fun_name, "(", object@assay_name, ")"), title = "Top most expressed genes" )
object |
( |
assay_name |
( |
summary_fun |
( |
n_top |
( |
min_threshold |
( |
x_lab |
( |
y_lab |
( |
title |
( |
The data frame is sorted in descending order of expression
and only the top
entries according to the selection criteria are included.
Note that exactly one of the arguments n_top
and min_threshold
must be
provided.
A HermesDataTopGenes object.
autoplot(HermesDataTopGenes)
: Creates a bar plot from a HermesDataTopGenes object,
where the y axis shows the expression statistics for each of the top genes
on the x-axis.
object <- hermes_data # Default uses average of raw counts across samples to rank genes. top_genes(object) # Instead of showing top 10 genes, can also set a minimum threshold on average counts. top_genes(object, n_top = NULL, min_threshold = 50000) # We can also use the maximum of raw counts across samples, by specifying a different # summary statistics function. result <- top_genes(object, summary_fun = rowMax) # Finally we can produce barplots based on the results. autoplot(result, title = "My top genes") autoplot(result, y_lab = "Counts", title = "My top genes")
object <- hermes_data # Default uses average of raw counts across samples to rank genes. top_genes(object) # Instead of showing top 10 genes, can also set a minimum threshold on average counts. top_genes(object, n_top = NULL, min_threshold = 50000) # We can also use the maximum of raw counts across samples, by specifying a different # summary statistics function. result <- top_genes(object, summary_fun = rowMax) # Finally we can produce barplots based on the results. autoplot(result, title = "My top genes") autoplot(result, y_lab = "Counts", title = "My top genes")
AnyHermesData
ObjectsThese functions are used internally only and therefore not exported. They work on
SummarizedExperiment::SummarizedExperiment
objects, and AnyHermesData
objects are
defined by successfully passing these validation checks.
validate_counts(object) validate_cols(required, actual) validate_row_data(object) validate_col_data(object) validate_names(object) validate_prefix(object)
validate_counts(object) validate_cols(required, actual) validate_row_data(object) validate_col_data(object) validate_names(object) validate_prefix(object)
object |
( |
required |
( |
actual |
( |
A character vector with the validation failure messages, or NULL
in
case validation passes.
validate_counts()
: validates that the first assay is counts
containing
non-missing, integer, non-negative values.
validate_cols()
: validates that required column names are contained in
actual column names.
validate_row_data()
: validates that the object contains rowData
with
required columns.
validate_col_data()
: validates that the object contains colData
with
required columns.
validate_names()
: validates that the object contains row and column names.
validate_prefix()
: validates that the object prefix is a string
and only contains alphabetic characters.
This helper function wraps SummarizedExperiment
objects into an
a MultiAssayExperiment
(MAE) object.
wrap_in_mae(x, name = deparse(substitute(x)))
wrap_in_mae(x, name = deparse(substitute(x)))
x |
( |
name |
( |
The MAE object with the only experiment being x
having the given
name
.
mae <- wrap_in_mae(summarized_experiment) mae[["summarized_experiment"]]
mae <- wrap_in_mae(summarized_experiment) mae[["summarized_experiment"]]