Package 'ppcseq' reference manual

Title:	Probabilistic Outlier Identification for RNA Sequencing Generalized Linear Models
Description:	Relative transcript abundance has proven to be a valuable tool for understanding the function of genes in biological systems. For the differential analysis of transcript abundance using RNA sequencing data, the negative binomial model is by far the most frequently adopted. However, common methods that are based on a negative binomial model are not robust to extreme outliers, which we found to be abundant in public datasets. So far, no rigorous and probabilistic methods for detection of outliers have been developed for RNA sequencing data, leaving the identification mostly to visual inspection. Recent advances in Bayesian computation allow large-scale comparison of observed data against its theoretical distribution given in a statistical model. Here we propose ppcseq, a key quality-control tool for identifying transcripts that include outlier data points in differential expression analysis, which do not follow a negative binomial distribution. Applying ppcseq to analyse several publicly available datasets using popular tools, we show that from 3 to 10 percent of differentially abundant transcripts across algorithms and datasets had statistics inflated by the presence of outliers.
Authors:	Stefano Mangiola [aut, cre]
Maintainer:	Stefano Mangiola <[email protected]>
License:	GPL-3
Version:	1.15.0
Built:	2025-03-17 06:30:27 UTC
Source:	https://github.com/bioc/ppcseq

The 'ppcseq' package.

Description

Relative transcript abundance has proven to be a valuable tool for understanding the function of genes in biological systems. For the differential analysis of transcript abundance using RNA sequencing data, the negative binomial model is by far the most frequently adopted. However, common methods that are based on a negative binomial model are not robust to extreme outliers, which we found to be abundant in public datasets. So far, no rigorous and probabilistic methods for detection of outliers have been developed for RNA sequencing data, leaving the identification mostly to visual inspection. Recent advances in Bayesian computation allow large-scale comparison of observed data against its theoretical distribution given in a statistical model. Here we propose ppcseq, a key quality-control tool for identifying transcripts that include outlier data points in differential expression analysis, which do not follow a negative binomial distribution. Applying ppcseq to analyse several publicly available datasets using popular tools, we show that from 3 to 10 percent of differentially abundant transcripts across algorithms and datasets had statistics inflated by the presence of outliers.

Usage

data(counts)
data(counts)

Value

See documentation

References

Mangiola S, Thomas E, Modrak M, Vehtari A, Papenfuss A (2021). “Probabilistic outlier identification for RNA sequencing generalized linear models.” NAR Genomics and Bioinformatics_, 3(1), lqab005. <URL: https://doi.org/10.1093/nargab/lqab005>.

counts

Description

Contains an example dataset for ppcseq, including RNA sequencing

Usage

counts
counts

Format

An object of class tbl_df (inherits from tbl, data.frame) with 394821 rows and 9 columns.

identify_outliers main

Description

This function runs the data modeling and statistical test for the hypothesis that a transcript includes outlier biological replicate.

Usage

identify_outliers(
  .data,
  formula = ~1,
  .sample,
  .transcript,
  .abundance,
  .significance,
  .do_check,
  .scaling_factor = NULL,
  percent_false_positive_genes = 1,
  how_many_negative_controls = 500,
  approximate_posterior_inference = TRUE,
  approximate_posterior_analysis = TRUE,
  draws_after_tail = 10,
  save_generated_quantities = FALSE,
  additional_parameters_to_save = c(),
  cores = detect_cores(),
  pass_fit = FALSE,
  do_check_only_on_detrimental = length(parse_formula(formula)) > 0,
  tol_rel_obj = 0.01,
  just_discovery = FALSE,
  seed = sample(seq_len(length.out = 999999), size = 1),
  adj_prob_theshold_2 = NULL
)
identify_outliers(
  .data,
  formula = ~1,
  .sample,
  .transcript,
  .abundance,
  .significance,
  .do_check,
  .scaling_factor = NULL,
  percent_false_positive_genes = 1,
  how_many_negative_controls = 500,
  approximate_posterior_inference = TRUE,
  approximate_posterior_analysis = TRUE,
  draws_after_tail = 10,
  save_generated_quantities = FALSE,
  additional_parameters_to_save = c(),
  cores = detect_cores(),
  pass_fit = FALSE,
  do_check_only_on_detrimental = length(parse_formula(formula)) > 0,
  tol_rel_obj = 0.01,
  just_discovery = FALSE,
  seed = sample(seq_len(length.out = 999999), size = 1),
  adj_prob_theshold_2 = NULL
)

Arguments

`.data`	A tibble including a transcript name column \| sample name column \| read counts column \| covariate columns \| Pvalue column \| a significance column
`formula`	A formula. The sample formula used to perform the differential transcript abundance analysis
`.sample`	A column name as symbol. The sample identifier
`.transcript`	A column name as symbol. The transcript identifier
`.abundance`	A column name as symbol. The transcript abundance (read count)
`.significance`	A column name as symbol. A column with the Pvalue, or other significance measure (preferred Pvalue over false discovery rate)
`.do_check`	A column name as symbol. A column with a boolean indicating whether a transcript was identified as differentially abundant
`.scaling_factor`	In case the scaling factor must not be calculated (TMM method) using the input data but provided. It is useful, for example, for pseudobulk single-cell where the scaling might depend on sample sequencing depth for all cells rather than a particular cell type.
`percent_false_positive_genes`	A real between 0 and 100. It is the aimed percent of transcript being a false positive. For example, percent_false_positive_genes = 1 provide 1 percent of the calls for outlier containing transcripts that has actually not outliers.
`how_many_negative_controls`	An integer. How many transcript from the bottom non-significant should be taken for inferring the mean-overdispersion trend.
`approximate_posterior_inference`	A boolean. Whether the inference of the joint posterior distribution should be approximated with variational Bayes It confers execution time advantage.
`approximate_posterior_analysis`	A boolean. Whether the calculation of the credible intervals should be done semi-analytically, rather than with pure sampling from the posterior. It confers execution time and memory advantage.
`draws_after_tail`	An integer. How many draws should on average be after the tail, in a way to inform CI.
`save_generated_quantities`	A boolean. Used for development and testing purposes
`additional_parameters_to_save`	A character vector. Used for development and testing purposes
`cores`	An integer. How many cored to be used with parallel calculations.
`pass_fit`	A boolean. Used for development and testing purposes
`do_check_only_on_detrimental`	A boolean. Whether to test only for detrimental outliers (same direction as the fold change). It allows to test for less transcript/sample pairs and therefore higher the probability threshold.
`tol_rel_obj`	A real. Used for development and testing purposes
`just_discovery`	A boolean. Used for development and testing purposes
`seed`	An integer. Used for development and testing purposes
`adj_prob_theshold_2`	A boolean. Used for development and testing purposes

Value

A nested tibble tbl with transcript-wise information: sample_wise_data | plot | ⁠ppc samples failed⁠ | ⁠tot deleterious_outliers⁠

Examples


library(dplyr)

data("counts")

if(Sys.info()[['sysname']] == "Linux")
result =
  counts %>%
  dplyr::mutate(  is_significant = ifelse(symbol %in% c("SLC16A12", "CYP1A1", "ART3"), TRUE, FALSE) ) %>%
 ppcseq::identify_outliers(
	formula = ~ Label,
	sample, symbol, value,
	.significance = PValue,
	.do_check  = is_significant,
	percent_false_positive_genes = 1,
	tol_rel_obj = 0.01,
	approximate_posterior_inference =TRUE,
	approximate_posterior_analysis =TRUE,
	how_many_negative_controls = 50,
	cores=1
)

library(dplyr)

data("counts")

if(Sys.info()[['sysname']] == "Linux")
result =
  counts %>%
  dplyr::mutate(  is_significant = ifelse(symbol %in% c("SLC16A12", "CYP1A1", "ART3"), TRUE, FALSE) ) %>%
 ppcseq::identify_outliers(
	formula = ~ Label,
	sample, symbol, value,
	.significance = PValue,
	.do_check  = is_significant,
	percent_false_positive_genes = 1,
	tol_rel_obj = 0.01,
	approximate_posterior_inference =TRUE,
	approximate_posterior_analysis =TRUE,
	how_many_negative_controls = 50,
	cores=1
)

plot_credible interval for theoretical data distributions

Description

Plot the data along the theoretical data distribution.

Usage

plot_credible_intervals(.data)
plot_credible_intervals(.data)

Arguments

.data

The tibble returned by identify_outliers

Value

A tibble with an additional plot column

Examples


library(dplyr)

data("counts")

if(Sys.info()[['sysname']] == "Linux"){
result =
  counts %>%
  dplyr::mutate(  is_significant = ifelse(symbol %in% c("SLC16A12", "CYP1A1", "ART3"), TRUE, FALSE) ) %>%
 ppcseq::identify_outliers(
	formula = ~ Label,
	sample, symbol, value,
	.significance = PValue,
	.do_check  = is_significant,
	percent_false_positive_genes = 1,
	tol_rel_obj = 0.01,
	approximate_posterior_inference =TRUE,
	approximate_posterior_analysis =TRUE,
	how_many_negative_controls = 50,
	cores=1
)

result_plot = result %>% plot_credible_intervals()
}

library(dplyr)

data("counts")

if(Sys.info()[['sysname']] == "Linux"){
result =
  counts %>%
  dplyr::mutate(  is_significant = ifelse(symbol %in% c("SLC16A12", "CYP1A1", "ART3"), TRUE, FALSE) ) %>%
 ppcseq::identify_outliers(
	formula = ~ Label,
	sample, symbol, value,
	.significance = PValue,
	.do_check  = is_significant,
	percent_false_positive_genes = 1,
	tol_rel_obj = 0.01,
	approximate_posterior_inference =TRUE,
	approximate_posterior_analysis =TRUE,
	how_many_negative_controls = 50,
	cores=1
)

result_plot = result %>% plot_credible_intervals()
}

Package 'ppcseq'

Help Index

The 'ppcseq' package.

Description

Usage

Value

References

counts

Description

Usage

Format

identify_outliers main

Description

Usage

Arguments

Value

Examples

plot_credible interval for theoretical data distributions

Description

Usage

Arguments

Value

Examples