Package 'fishpond'

Title:	Fishpond: downstream methods and tools for expression data
Description:	Fishpond contains methods for differential transcript and gene expression analysis of RNA-seq data using inferential replicates for uncertainty of abundance quantification, as generated by Gibbs sampling or bootstrap sampling. Also the package contains a number of utilities for working with Salmon and Alevin quantification files.
Authors:	Anqi Zhu [aut, ctb], Michael Love [aut, cre], Avi Srivastava [aut, ctb], Rob Patro [aut, ctb], Joseph Ibrahim [aut, ctb], Hirak Sarkar [ctb], Euphy Wu [ctb], Noor Pratap Singh [ctb], Scott Van Buren [ctb], Dongze He [ctb], Steve Lianoglou [ctb], Wes Wilson [ctb], Jeroen Gilis [ctb]
Maintainer:	Michael Love <[email protected]>
License:	GPL-2
Version:	2.13.0
Built:	2025-03-29 05:54:29 UTC
Source:	https://github.com/bioc/fishpond

Help Index

Fishpond: downstream methods and tools for expression data
Read statistics and nulls from CSV file
Construct a sparse matrix of transcript compatibility counts from alevin output
Compute inferential relative variance (InfRV)
deswish: DESeq2-apeglm With Inferential Samples Helps
Obtain a trace of inferential replicates for a sample
Import allelic counts as a SummarizedExperiment
Create isoform proportions from scaled data
Label rows to keep based on minimal count
Load in data from alevin-fry USA mode
Make pseudo-inferential replicates from mean and variance
Make simulated data for swish for examples/testing
Make a GRanges linking transcripts to TSS within gene
Helper function for distributing Swish on a subset of data
Plot allelic counts in a gene context using Gviz
Plot allelic ratio heatmap
Plot inferential replicates for a gene or transcript
MA plot - log fold change over average counts
Construct a sparse matrix of transcript compatibility counts from salmon output
Scale inferential replicate counts
Function for splitting SummarizedExperiment into separate RDS files
Swish method: differential expression accounting for inferential uncertainty

Fishpond: downstream methods and tools for expression data

Description

This package provides statistical methods and other tools for working with Salmon and Alevin quantification of RNA-seq data. Fishpond contains the Swish non-parametric method for detecting differential transcript expression (DTE). Swish can also be used to detect differential gene expresion (DGE), to perform allelic analysis, or to assess changes in isoform proportions.

Details

The main Swish functions are:

scaleInfReps - scaling transcript or gene expression data
labelKeep - labelling which features have sufficient counts
swish - perform non-parametric differential analysis
Plots, e.g., plotMASwish, plotInfReps

All software-related questions should be posted to the Bioconductor Support Site:

https://support.bioconductor.org

The code can be viewed at the GitHub repository, which also lists the contributor code of conduct:

https://github.com/mikelove/fishpond

References

Swish method:

Zhu, A., Srivastava, A., Ibrahim, J.G., Patro, R., Love, M.I. (2019) Nonparametric expression analysis using inferential replicate counts. Nucleic Acids Research. https://doi.org/10.1093/nar/gkz622

Compression, makeInfReps and splitSwish:

Van Buren, S., Sarkar, H., Srivastava, A., Rashid, N.U., Patro, R., Love, M.I. (2020) Compression of quantification uncertainty for scRNA-seq counts. bioRxiv. https://doi.org/10.1101/2020.07.06.189639

Read statistics and nulls from CSV file

Description

After running splitSwish and the associated Snakefile, this function can be used to gather and add the results to the original object. See the alevin section of the vignette for an example.

Usage

addStatsFromCSV(y = NULL, infile, estPi0 = FALSE)
addStatsFromCSV(y = NULL, infile, estPi0 = FALSE)

Arguments

`y`	a SummarizedExperiment (if NULL, function will output a data.frame)
`infile`	character, path to the `summary.csv` file
`estPi0`	logical, see `swish`

Value

the SummarizedExperiment with metadata columns added, or if y is NULL, a data.frame of compiled results

Construct a sparse matrix of transcript compatibility counts from alevin output

Description

Constructs a UMI count matrix with equivalence class identifiers in the rows and barcode identifiers in the columns. The count matrix is generated from one or multiple 'bfh.txt' files that have been created by running alevin-fry with the –dumpBFH flag. Alevin-fry - https://doi.org/10.1186/s13059-019-1670-y

Usage

alevinEC(
  paths,
  tx2gene,
  multigene = FALSE,
  ignoreTxVersion = FALSE,
  ignoreAfterBar = FALSE,
  quiet = FALSE
)
alevinEC(
  paths,
  tx2gene,
  multigene = FALSE,
  ignoreTxVersion = FALSE,
  ignoreAfterBar = FALSE,
  quiet = FALSE
)

Arguments

`paths`	'Charachter' or 'character vector', path specifying the location of the 'bfh.txt' files generated with alevin-fry.
`tx2gene`	A 'dataframe' linking transcript identifiers to their corresponding gene identifiers. Transcript identifiers must be in a column 'isoform_id'. Corresponding gene identifiers must be in a column 'gene_id'.
`multigene`	'Logical', should equivalence classes that are compatible with multiple genes be retained? Default is 'FALSE', removing such ambiguous equivalence classes.
`ignoreTxVersion`	logical, whether to split the isoform id on the '.' character to remove version information to facilitate matching with the isoform id in 'tx2gene' (default FALSE).
`ignoreAfterBar`	logical, whether to split the isoform id on the '\|' character to facilitate matching with the isoform id in 'tx2gene' (default FALSE).
`quiet`	'Logical', set 'TRUE' to avoid displaying messages.

Value

A list with two elements. The first element 'counts' is a sparse count matrix with equivalence class identifiers in the rows and barcode identifiers followed by an underscore and a sample identifier in the columns. The second element 'tx2gene_matched' allows for linking the equivalence class identifiers to their respective transcripts and genes.

Details

The resulting count matrix uses equivalence class identifiers as rownames. These can be linked to respective transcripts and genes using the 'tx2gene_matched' element of the output. Specifically, if the equivalence class identifier reads 1|2|8, then the equivalence class is compatible with the transcripts and their respective genes in rows 1, 2 and 8 of 'tx2gene_matched'.

Author(s)

Jeroen Gilis

Compute inferential relative variance (InfRV)

Description

InfRV is a useful quantity for comparing groups of features (transcripts, genes, etc.) by inferential uncertainty. This function provides computation of the mean InfRV over samples, per feature, stored in mcols(y)$meanInfRV.

Usage

computeInfRV(y, pc = 5, shift = 0.01, meanVariance, useCounts = FALSE)
computeInfRV(y, pc = 5, shift = 0.01, meanVariance, useCounts = FALSE)

Arguments

`y`	a SummarizedExperiment
`pc`	a pseudocount parameter for the denominator
`shift`	a final shift parameter
`meanVariance`	logical, use pre-computed inferential mean and variance assays instead of `counts` and computed variance from `infReps`. If missing, will use pre-computed mean and variance when present
`useCounts`	logical, whether to use the MLE count matrix for the mean instead of mean of inferential replicates. this argument is for backwards compatability, as previous versions used counts. Default is FALSE

Details

InfRV is defined in Zhu et al. (2019) as: $\max(s^2 - \mu, 0) / \mu$ , using the inferential sample variance and sample mean. This formulation takes the non-Poisson part of the inferential variance and scales by the mean, which effectively stabilizes inferential uncertainty over mean count. In practice, we also add pc to the denominator and shift to the final quantity, to facilitate visualization.

This function also computes and adds the mean and variance of inferential replicates, which can be useful ahead of plotInfReps. Note that InfRV is not used in the swish statistical method (for generating test statistics, p-values or q-values), it is just for visualization.

Value

a SummarizedExperiment with meanInfRV in the metadata columns

References

Anqi Zhu, Avi Srivastava, Joseph G Ibrahim, Rob Patro, Michael I Love "Nonparametric expression analysis using inferential replicate counts" Nucleic Acids Research (2019). https://doi.org/10.1093/nar/gkz622

deswish: DESeq2-apeglm With Inferential Samples Helps

Description

The DESeq2-apeglm With Inferential Samples implementation supposes a hierarchical distribution of log2 fold changes. The final posterior standard deviation is calculated by adding the posterior variance from modeling biological replicates computed by apeglm, and the observed variance on the posterior mode over inferential replicates. This function requires the DESeq2 and apeglm packages to be installed and will print an error if they are not found.

Usage

deswish(y, x, coef)
deswish(y, x, coef)

Arguments

`y`	a SummarizedExperiment containing the inferential replicate matrices, as output by `tximeta`, and then with `labelKeep` applied. One does not need to run `scaleInfReps` as scaling is done internally via DESeq2.
`x`	the design matrix
`coef`	the coefficient to test (see `lfcShrink`)

Value

a SummarizedExperiment with metadata columns added: the log2 fold change and posterior SD using inferential replicates, and the original log2 fold change (apeglm) and its posterior SD

References

The DESeq and lfcShrink function in the DESeq2 package:

Zhu, Ibrahim, Love "Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences" Bioinformatics (2018).

Love, Huber, Anders "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2" Genome Biology (2014).

Examples


# a small example... 500 genes, 10 inf reps
y <- makeSimSwishData(m=500, numReps=10)
y <- labelKeep(y)
#y <- deswish(y, ~condition, "condition_2_vs_1")

# a small example... 500 genes, 10 inf reps
y <- makeSimSwishData(m=500, numReps=10)
y <- labelKeep(y)
#y <- deswish(y, ~condition, "condition_2_vs_1")

Obtain a trace of inferential replicates for a sample

Description

Simple helper function to obtain a trace (e.g. MCMC trace) of the ordered inferential replicates for one samples. Supports either multiple features, idx, or multiple samples, samp_idx (not both). Returns a tidy data.frame for easy plotting.

Usage

getTrace(y, idx, samp_idx)
getTrace(y, idx, samp_idx)

Arguments

`y`	a SummarizedExperiment with inferential replicates as assays `infRep1` etc.
`idx`	the names or row numbers of the gene or transcript to plot
`samp_idx`	the names or column numbers of the samples to plot

Value

a data.frame with the counts along the interential replicates, possible with additional columns specifying feature or sample

Examples


y <- makeSimSwishData()
getTrace(y, "gene-1", "s1")

y <- makeSimSwishData()
getTrace(y, "gene-1", "s1")

Import allelic counts as a SummarizedExperiment

Description

Read in Salmon quantification of allelic counts from a diploid transcriptome. Assumes that diploid transcripts are marked with the following suffix: an underscore and a consistent symbol for each of the two alleles, e.g. ENST123_M and ENST123_P, or ENST123_alt and ENST123_ref, etc. importAllelicCounts requires the tximeta package. Further information in Details below.

Usage

importAllelicCounts(
  coldata,
  a1,
  a2,
  format = c("wide", "assays"),
  tx2gene = NULL,
  ...
)
importAllelicCounts(
  coldata,
  a1,
  a2,
  format = c("wide", "assays"),
  tx2gene = NULL,
  ...
)

Arguments

`coldata`	a data.frame as used in `tximeta`
`a1`	the symbol for the effect allele
`a2`	the symbol for the non-effect allele
`format`	either `"wide"` or `"assays"` for whether to combine the allelic counts as columns (wide) or put the allelic count information in different assay slots (assays). For wide output, the data for the non-effect allele (a2) comes first, then the effect allele (a1), e.g. `[a2 \| a1]`. The `ref` level of the factor variable `se$allele` will be `"a2"` (so by default comparisons will be: a1 vs a2). For assays output, all of the original matrices are renamed with a prefix, either `a1-` or `a2-`.
`tx2gene`	optional, a data.frame with first column indicating transcripts, second column indicating genes (or any other transcript grouping). Alternatively, this can be a GRanges object with required columns `tx_id`, and `group_id` (see `makeTx2Tss`). For more information on this argument, see Details.
`...`	any arguments to pass to tximeta

Details

Requirements - There must be exactly two alleles for each transcript, and the --keep-duplicates option should be used in Salmon indexing to avoid removal of transcripts with identical sequence. The output object has half the number of transcripts, with the two alleles either stored in a "wide" object, or as re-named "assays". Note carefully that the symbol provided to a1 is used as the effect allele, and a2 is used as the non-effect allele (see the format argument description and Value description below).

tx2gene - The two columns should include the a1 and a2 suffix for the transcripts and genes/groups, or those will be added internally, if it is detected that the first transcript does not have these suffices. For example if _alt or _ref, or _M or _P (as indicated by the a1 and a2 arguments) are not present in the table, the table rows will be duplicated with those suffices added on behalf of the user. If tx2gene is not provided, the output object will be transcript-level. Do not attempt to set the txOut argument, it will conflict with internal calls to downstream functions. If the a1/a2 suffices are not at the end of the transcript name in the quantification files, e.g. ENST123_M|<metadata>, then ignoreAfterBar=TRUE can be used to match regardless of the string following | in the quantification files.

skipMeta=TRUE is used, as it is assumed the diploid transcriptome does not match any reference transcript collection. This may change in future iterations of the function, depending on developments in upstream annotations and software.

If tx2gene is a GRanges object, the rowRanges of the output will be the reduced ranges of the grouped input ranges, with tx_id collapsed into a CharacterList, and TSS positions saved as an IntegerList, if these are not equal among the transcripts of a group. Other metadata columns are not manipulated, just the metadata for the first range is returned.

Value

a SummarizedExperiment, with allele counts (and other data) combined into a wide matrix [a2 | a1], or as assays (a1, then a2). The original strings associated with a1 and a2 are stored in the metadata of the object, in the alleles list element. Note the reference level of se$allele will be "a2", such that comparisons by default will be a1 vs a2 (effect vs non-effect).

References

Euphy Wu, Noor P. Singh, Kwangbom Choi, Mohsen Zakeri, Matthew Vincent, Gary A. Churchill, Cheryl L. Ackert-Bicknell, Rob Patro, Michael I. Love. "Detecting isoform-level allelic imbalance accounting for inferential uncertainty" bioRxiv (2022) https://doi.org/10.1101/2022.08.12.503785

Create isoform proportions from scaled data

Description

Takes output of scaled (and optionally filtered) counts and returns isoform proportions by dividing out the total scaled count for the gene for each sample. The operation is performed on the counts assay, then creating a new assay called isoProp, and on all of the inferential replicates, turning them from counts into isoform proportions. Any transcripts (rows) from single isoform genes are removed, and the transcripts will be re-ordered by gene ID.

Usage

isoformProportions(y, geneCol = "gene_id", quiet = FALSE)
isoformProportions(y, geneCol = "gene_id", quiet = FALSE)

Arguments

`y`	a SummarizedExperiment
`geneCol`	the name of the gene ID column in the metadata columns for the rows of `y`
`quiet`	display no messages

Value

a SummarizedExperiment, with single-isoform transcripts removed, and transcripts now ordered by gene

Label rows to keep based on minimal count

Description

Adds a column keep to mcols(y) that specifies which rows of the SummarizedExperiment will be included in statistical testing. Rows are not removed, just marked with the logical keep.

Usage

labelKeep(y, minCount = 10, minN = 3, x)
labelKeep(y, minCount = 10, minN = 3, x)

Arguments

`y`	a SummarizedExperiment
`minCount`	the minimum count
`minN`	the minimum sample size at `minCount`
`x`	the name of the condition variable, will use the smaller of the two groups to set `minN`. Similar to edgeR's `filterByExpr`, as the smaller group grows past 10, `minN` grows only by 0.7 increments of sample size

Value

a SummarizedExperiment with a new column keep in mcols(y)

Examples


y <- makeSimSwishData()
y <- scaleInfReps(y)
y <- labelKeep(y)

y <- makeSimSwishData()
y <- scaleInfReps(y)
y <- labelKeep(y)

Load in data from alevin-fry USA mode

Description

Enables easy loading of sparse data matrices provided by alevin-fry USA mode.

Usage

loadFry(fryDir, outputFormat = "scRNA", nonzero = FALSE, quiet = FALSE)
loadFry(fryDir, outputFormat = "scRNA", nonzero = FALSE, quiet = FALSE)

Arguments

`fryDir`	path to the output directory returned by alevin-fry quant command. This directory should contain a `metainfo.json`, and an alevin folder which contains `quants_mat.mtx`, `quants_mat_cols.txt` and `quants_mat_rows.txt`
`outputFormat`	can be either be a list that defines the desired format of the output `SingleCellExperiment` object or a string that represents one of the pre-defined output formats, which are "scRNA", "snRNA", "all", "scVelo", "velocity", "U+S+A" and "S+A". See details for the explanations of the pre-defined formats and how to define custom format.
`nonzero`	whether to filter cells with non-zero expression value across all genes (default `FALSE`). If `TRUE`, this will filter based on all assays. If a string vector of assay names, it will filter based on the matching assays in the vector. If not in USA mode, it must be TRUE/FALSE/counts.
`quiet`	logical whether to display no messages

Value

A SingleCellExperiment object that contains one or more assays. Each assay consists of a gene by cell count matrix. The row names are feature names, and the column names are cell barcodes

Details about `loadFry`

This function consumes the result folder returned by running alevin-fry quant in unspliced, spliced, ambiguous (USA) quantification mode, and returns a SingleCellExperiment object that contains a final count for each gene within each cell. In USA mode, alevin-fry quant returns a count matrix contains three types of count for each feature (gene) within each sample (cell or nucleus), which represent the spliced mRNA count of the gene (S), the unspliced mRNA count of the gene (U), and the count of UMIs whose splicing status is ambiguous for the gene (A). For each assay defined by outputFormat, these three counts of a gene within a cell will be summed to get the final count of the gene according to the rule defined in the outputFormat. The returned object will contains the desired assays defined by outputFormat, with rownames as the barcode of samples and colnames as the feature names.

Details about the output format

The outputFormat argument takes either be a list that defines the desired format of the output SingleCellExperiment object or a string that represents one of the pre-defined output format.

Currently the pre-defined formats of the output SingleCellExperiment object are:

"scRNA":: This format is recommended for single cell experiments. It returns a counts assay that contains the S+A count of each gene in each cell, and a unspliced assay that contains the U count of each gene in each cell.
"snRNA", "all" and "U+S+A":: These three formats are the same. They return a counts assay that contains the U+S+A count of each gene in each cell without any extra layers. "snRNA" is recommended for single-nucleus RNA-sequencing experiments. "raw" is recommended for mimicing CellRanger 7's behavior, which returns this format for both single-cell and single-nucleus experiments.
"S+A":: This format returns a counts assay that contains the S+A count of each gene in each cell.
"raw":: This format puts the three kinds of counts into three separate assays, which are unspliced, spliced and ambiguous.
"velocity":: This format contains two assays. The spliced assay contains the S+A count of each gene in each cell. The unspliced assay contains the U counts of each gene in each cell.
"scVelo":: This format is for direct entry into velociraptor R package or other scVelo downstream analysis pipeline for velocity analysis in R with Bioconductor. It adds the expected "S"-pliced assay and removes errors for size factors being non-positive.

A custom output format can be defined using a list. Each element in the list defines an assay in the output SingleCellExperiment object. The name of an element in the list will be the name of the corresponding assay in the output object. Each element in the list should be defined as a vector that takes at least one of the three kinds of count, which are U, S and A. See the provided toy example for defining a custom output format.

Author(s)

Dongze He, with contributions from Steve Lianoglou, Wes Wilson

References

alevin-fry publication:

He, D., Zakeri, M., Sarkar, H. et al. "Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data." Nature Methods 19, 316–322 (2022). https://doi.org/10.1038/s41592-022-01408-3

Examples


# Get path for minimal example avelin-fry output dir
testdat <- fishpond:::readExampleFryData("fry-usa-basic")

# This is exactly how the velocity format defined internally.
custom_velocity_format <- list("spliced"=c("S","A"), "unspliced"=c("U"))

# Load alevin-fry gene quantification in velocity format
sce <- loadFry(fryDir=testdat$parent_dir, outputFormat=custom_velocity_format)
SummarizedExperiment::assayNames(sce)

# Load the same data but use pre-defined, velociraptor R pckage desired format
scvelo_format <- "scVelo"

scev <- loadFry(fryDir=testdat$parent_dir, outputFormat=scvelo_format, nonzero=TRUE)
SummarizedExperiment::assayNames(scev)

# Get path for minimal example avelin-fry output dir
testdat <- fishpond:::readExampleFryData("fry-usa-basic")

# This is exactly how the velocity format defined internally.
custom_velocity_format <- list("spliced"=c("S","A"), "unspliced"=c("U"))

# Load alevin-fry gene quantification in velocity format
sce <- loadFry(fryDir=testdat$parent_dir, outputFormat=custom_velocity_format)
SummarizedExperiment::assayNames(sce)

# Load the same data but use pre-defined, velociraptor R pckage desired format
scvelo_format <- "scVelo"

scev <- loadFry(fryDir=testdat$parent_dir, outputFormat=scvelo_format, nonzero=TRUE)
SummarizedExperiment::assayNames(scev)

Make pseudo-inferential replicates from mean and variance

Description

Makes pseudo-inferential replicate counts from mean and variance assays. The simulated counts are drawn from a negative binomial distribution, with mu=mean and size set using a method of moments estimator for dispersion.

Usage

makeInfReps(y, numReps, minDisp = 0.001)
makeInfReps(y, numReps, minDisp = 0.001)

Arguments

`y`	a SummarizedExperiment
`numReps`	how many inferential replicates
`minDisp`	the minimal dispersion value, set after method of moments estimation from inferential mean and variance

Details

Note that these simulated counts only reflect marginal variance (one transcript or gene at a time), and do not capture the covariance of counts across transcripts or genes, unlike imported inferential replicate data. Therefore, makeInfReps should not be used with summarizeToGene to create gene-level inferential replicates if inferential replicates were originally created on the transcript level. Instead, import the original inferential replicates.

Value

a SummarizedExperiment

References

Examples


library(SummarizedExperiment)
mean <- matrix(1:4,ncol=2)
variance <- mean
se <- SummarizedExperiment(list(mean=mean, variance=variance))
se <- makeInfReps(se, numReps=50)

library(SummarizedExperiment)
mean <- matrix(1:4,ncol=2)
variance <- mean
se <- SummarizedExperiment(list(mean=mean, variance=variance))
se <- makeInfReps(se, numReps=50)

Make simulated data for swish for examples/testing

Description

Makes a small swish dataset for examples and testing. The first six genes have some differential expression evidence in the counts, with varying degree of inferential variance across inferential replicates (1-2: minor, 3-4: some, 5-6: substantial). The 7th and 8th genes have all zeros to demonstrate labelKeep.

Usage

makeSimSwishData(
  m = 1000,
  n = 10,
  numReps = 20,
  null = FALSE,
  meanVariance = FALSE,
  allelic = FALSE,
  diffAI = FALSE,
  dynamicAI = FALSE
)
makeSimSwishData(
  m = 1000,
  n = 10,
  numReps = 20,
  null = FALSE,
  meanVariance = FALSE,
  allelic = FALSE,
  diffAI = FALSE,
  dynamicAI = FALSE
)

Arguments

`m`	number of genes
`n`	number of samples
`numReps`	how many inferential replicates to generate
`null`	logical, whether to make an all null dataset
`meanVariance`	logical, whether to output only mean and variance of inferential replicates
`allelic`	logical, whether to make an allelic sim dataset
`diffAI`	logical, whether to make a differential allelic sim dataset
`dynamicAI`	logical, whether to make a dynamic allelic sim dataset

Value

a SummarizedExperiment

Examples


library(SummarizedExperiment)
y <- makeSimSwishData()
assayNames(y)

library(SummarizedExperiment)
y <- makeSimSwishData()
assayNames(y)

Make a GRanges linking transcripts to TSS within gene

Description

This helper function takes either a TxDb/EnsDb or GRanges object as input and outputs a GRanges object where transcripts are aggregated to the gene + TSS (transcription start site). For nearby TSS that should be grouped together, see maxgap.

Usage

makeTx2Tss(x, maxgap = 0)
makeTx2Tss(x, maxgap = 0)

Arguments

`x`	either TxDb/EnsDb or GRanges object. The GRanges object should have metadata columns `tx_id` and `gene_id`
`maxgap`	integer, number of basepairs to use determining whether to combine nearby TSS

Value

GRanges with columns tx_id, tss, and group_id

Examples

## Not run: 
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86
t2t <- makeTx2Tss(edb)

## End(Not run)

## Not run: 
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86
t2t <- makeTx2Tss(edb)

## End(Not run)

Helper function for distributing Swish on a subset of data

Description

This function is called by the Snakefile that is generated by splitSwish. See alevin example in the vignette. As such, it doesn't need to be run by users in an interactive R session.

Usage

miniSwish(
  infile,
  outfile,
  numReps = 20,
  lengthCorrect = FALSE,
  overwrite = FALSE,
  ...
)
miniSwish(
  infile,
  outfile,
  numReps = 20,
  lengthCorrect = FALSE,
  overwrite = FALSE,
  ...
)

Arguments

`infile`	path to an RDS file of a SummarizedExperiment
`outfile`	a CSV file to write out
`numReps`	how many inferential replicates to generate
`lengthCorrect`	logical, see `scaleInfReps`, and Swish vignette. As this function is primarily for alevin, the default is `FALSE`
`overwrite`	logical, whether `outfile` should overwrite an existing file
`...`	arguments passed to `swish`

Details

Note that the default for length correction is FALSE, as opposed to the default in scaleInfReps which is TRUE. The default for numReps here is 20.

Value

nothing, files are written out

Plot allelic counts in a gene context using Gviz

Description

Plot allelic data (allelic proportions, isoform propostions) in a gene context leveraging the Gviz package. See the allelic vignette for example usage. TPM and count filters are used by default to clean up the plot of features with minimal signal; note that the isoform proportion displayed at the bottom of the plot is among the features that pass the filtering steps. If the function is not responding, it is likely due to issues connecting to UCSC servers to draw the ideogram, in this case set ideogram=FALSE.

Usage

plotAllelicGene(
  y,
  gene,
  db,
  region = NULL,
  symbol = NULL,
  genome = NULL,
  tpmFilter = 1,
  isoPropFilter = 0.05,
  countFilter = 10,
  pc = 1,
  transcriptAnnotation = "symbol",
  labels = list(a2 = "a2", a1 = "a1"),
  qvalue = TRUE,
  log2FC = TRUE,
  ideogram = FALSE,
  cov = NULL,
  covFacetIsoform = FALSE,
  allelicCol = c("dodgerblue", "goldenrod1"),
  isoformCol = "firebrick",
  statCol = "black",
  gridCol = "grey80",
  baselineCol = "black",
  titleCol = "black",
  titleAxisCol = "black",
  titleBgCol = "white",
  geneBorderCol = "darkblue",
  geneFillCol = "darkblue",
  genomeAxisCol = "black",
  innerFontCol = "black",
  ...
)
plotAllelicGene(
  y,
  gene,
  db,
  region = NULL,
  symbol = NULL,
  genome = NULL,
  tpmFilter = 1,
  isoPropFilter = 0.05,
  countFilter = 10,
  pc = 1,
  transcriptAnnotation = "symbol",
  labels = list(a2 = "a2", a1 = "a1"),
  qvalue = TRUE,
  log2FC = TRUE,
  ideogram = FALSE,
  cov = NULL,
  covFacetIsoform = FALSE,
  allelicCol = c("dodgerblue", "goldenrod1"),
  isoformCol = "firebrick",
  statCol = "black",
  gridCol = "grey80",
  baselineCol = "black",
  titleCol = "black",
  titleAxisCol = "black",
  titleBgCol = "white",
  geneBorderCol = "darkblue",
  geneFillCol = "darkblue",
  genomeAxisCol = "black",
  innerFontCol = "black",
  ...
)

Arguments

`y`	a SummarizedExperiment (see `swish`)
`gene`	the name of the gene of interest, requires a column `gene_id` in the metadata columns of the rowRanges of y
`db`	either a TxDb or EnsDb object to use for the gene model
`region`	GRanges, the region to be displayed in the Gviz plot. if not specified, will be set according to the gene plus 20 of the total gene extent on either side
`symbol`	alternative to `gene`, to specify the gene of interest according to a column `symbol` in the metadata columns of the rowRanges of y
`genome`	UCSC genome code (e.g. `"hg38"`, if not specified it will use the `GenomeInfoDb::genome()` of the rowRanges of `y`
`tpmFilter`	minimum TPM value (mean over samples) to keep a feature
`isoPropFilter`	minimum percent of isoform proportion to keep a feature
`countFilter`	minimum count value (mean over samples) to keep a feature
`pc`	pseudocount to avoid dividing by zero in allelic proportion calculation
`transcriptAnnotation`	argument passed to Gviz::GeneRegionTrack (`"symbol"`, `"gene"`, `"transcript"`, etc.)
`labels`	list, labels for a2 (non-effect) and a1 (effect) alleles
`qvalue`	logical, whether to inclue qvalue track
`log2FC`	logical, whether to include log2FC track
`ideogram`	logical, whether to include ideogram track
`cov`	character specifying a factor or integer variable to use to facet the allelic proportion plots, should be a column in `colData(y)`
`covFacetIsoform`	logical, if `cov` is provided, should it also be used to facet the isoform proportion track, in addition to the allelic proportion track
`allelicCol`	the colors of the points and lines for allelic proportion
`isoformCol`	the colors of the points and lines for isoform proportion
`statCol`	the color of the lollipops for q-value and log2FC
`gridCol`	the color of the grid in the data tracks
`baselineCol`	the color of the horizontal baseline for q-value and lo2gFC
`titleCol`	font color of the side titles (track labels)
`titleAxisCol`	axis color of the side titles (track labels)
`titleBgCol`	background color of the side titles (track labels)
`geneBorderCol`	the color of the borders and font in gene region track
`geneFillCol`	the color of the fill in the gene region track
`genomeAxisCol`	line color of the genome axis track
`innerFontCol`	font color of genome axis track, ideogram, and allelic proportion legend
`...`	additional arguments passed to `Gviz::plotTracks()`

Value

nothing, a plot is displayed

References

The methods for allelic expression analysis are described in:

This function makes use of the Gviz package that is described in:

Hahne, F., Ivanek, R. (2016). Visualizing Genomic Data Using Gviz and Bioconductor. In: Mathé, E., Davis, S. (eds) Statistical Genomics. Methods in Molecular Biology, vol 1418. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3578-9_16

Plot allelic ratio heatmap

Description

Plot allelic ratio heatmap over features and samples using the pheatmap package. The a1/(a2 + a1) ratio is displayed.

Usage

plotAllelicHeatmap(
  y,
  idx,
  breaks = NULL,
  cluster_cols = FALSE,
  main = "Allelic ratio",
  stripAfterChar = "-",
  ...
)
plotAllelicHeatmap(
  y,
  idx,
  breaks = NULL,
  cluster_cols = FALSE,
  main = "Allelic ratio",
  stripAfterChar = "-",
  ...
)

Arguments

`y`	a SummarizedExperiment (see `swish`)
`idx`	a numeric or logical vector of which features to plot
`breaks`	breaks passed along to pheatmap
`cluster_cols`	logical, passed to pheatmap
`main`	title of the plot
`stripAfterChar`	for the column names, if specified will strip allelic identifiers after this character, default is hyphen. set to NULL to avoid this action
`...`	other arguments passed to pheatmap

Value

nothing, a plot is displayed

References

The methods for allelic expression analysis are described in:

This function makes use of the pheatmap package:

Kolde, Raivo. "Pheatmap: pretty heatmaps." R package version 1.2 (2012): 726.

Plot inferential replicates for a gene or transcript

Description

For datasets with inferential replicates, boxplots are drawn for the two groups and potentially grouped by covariates. For datasets with only mean and variance, points and intervals (95 approximation) are drawn. Additionally, for numeric x values, points and intervals will be drawn and computeInfRV should be run first in order to add the mean and variance statistics.

Usage

plotInfReps(
  y,
  idx,
  x,
  cov = NULL,
  colsDrk = c("dodgerblue", "goldenrod4", "royalblue4", "red3", "purple4", "darkgreen"),
  colsLgt = c("lightblue1", "goldenrod1", "royalblue1", "salmon1", "orchid1",
    "limegreen"),
  xaxis,
  xlab,
  ylim,
  main,
  mainCol,
  legend = FALSE,
  legendPos = "topleft",
  legendTitle = FALSE,
  legendCex = 1,
  useMean = TRUE,
  q = qnorm(0.975),
  applySF = FALSE,
  reorder,
  thin,
  shiftX
)
plotInfReps(
  y,
  idx,
  x,
  cov = NULL,
  colsDrk = c("dodgerblue", "goldenrod4", "royalblue4", "red3", "purple4", "darkgreen"),
  colsLgt = c("lightblue1", "goldenrod1", "royalblue1", "salmon1", "orchid1",
    "limegreen"),
  xaxis,
  xlab,
  ylim,
  main,
  mainCol,
  legend = FALSE,
  legendPos = "topleft",
  legendTitle = FALSE,
  legendCex = 1,
  useMean = TRUE,
  q = qnorm(0.975),
  applySF = FALSE,
  reorder,
  thin,
  shiftX
)

Arguments

`y`	a SummarizedExperiment (see `swish`)
`idx`	the name or row number of the gene or transcript
`x`	the name of the condition variable for splitting and coloring the samples or cells. Also can be a numeric, e.g. pseudotime, in which case, `cov` can be used to designate groups for coloring
`cov`	the name of the covariate for adjustment
`colsDrk`	dark colors for the lines of the boxes
`colsLgt`	light colors for the inside of the boxes
`xaxis`	logical, whether to label the sample numbers. default is `TRUE` if there are less than 30 samples
`xlab`	the x-axis label
`ylim`	y limits
`main`	title
`mainCol`	name of metadata column to use for title (instead of rowname)
`legend`	logical, show simple legend (default FALSE)
`legendPos`	character, position of the legend (default "topleft")
`legendTitle`	logical, whether to add the name of the grouping variable as a title on the legend (default FALSE)
`legendCex`	numeric, size of the legend (default 1)
`useMean`	logical, when inferential replicates are not present or when `x` is continuous, whether to use the `mean` assay or the `counts` assay for plotting
`q`	numeric, the quantile to use when plotting the intervals when inferential replicates are not present or when `x` is continuous. Default is `qnorm(.975) ~= 1.96` corresponding to 95 intervals
`applySF`	logical, when inferential replicates are not present, should `y$sizeFactor` be divided out from the mean and interval plots (default FALSE)
`reorder`	logical, should points within a group defined by condition and covariate be re-ordered by their count value (default is FALSE, except for alevin data)
`thin`	integer, should the mean and interval lines be drawn thin (the default switches from 0 [not thin] to 1 [thinner] at n=150 cells, and from 1 [thinner] to 2 [thinnest] at n=400 cells)
`shiftX`	when `x` is continuous and `cov` is provided, the amount to shift the values on the x-axis to improve visibility of the point and line ranges (will be subtracted from the first level of `cov` and added to the second level of `cov`)

Value

nothing, a plot is displayed

Examples


y <- makeSimSwishData()
plotInfReps(y, 3, "condition")

y <- makeSimSwishData(n=40)
y$batch <- factor(rep(c(1,2,3,1,2,3),c(5,10,5,5,10,5)))
plotInfReps(y, 3, "condition", "batch")

y <- makeSimSwishData()
plotInfReps(y, 3, "condition")

y <- makeSimSwishData(n=40)
y$batch <- factor(rep(c(1,2,3,1,2,3),c(5,10,5,5,10,5)))
plotInfReps(y, 3, "condition", "batch")

MA plot - log fold change over average counts

Description

MA plot - log fold change over average counts

Usage

plotMASwish(y, alpha = 0.05, sigcolor = "blue", ...)
plotMASwish(y, alpha = 0.05, sigcolor = "blue", ...)

Arguments

`y`	a SummarizedExperiment (see `swish`)
`alpha`	the FDR threshold for coloring points
`sigcolor`	the color for the significant points
`...`	passed to plot

Value

nothing, a plot is displayed

Examples


y <- makeSimSwishData()
y <- scaleInfReps(y)
y <- labelKeep(y)
y <- swish(y, x="condition")
plotMASwish(y)

y <- makeSimSwishData()
y <- scaleInfReps(y)
y <- labelKeep(y)
y <- swish(y, x="condition")
plotMASwish(y)

Construct a sparse matrix of transcript compatibility counts from salmon output

Description

Constructs a count matrix with equivalence class identifiers in the rows. The count matrix is generated from one or multiple 'eq_classes.txt' files that have been created by running salmon with the –dumpEq flag. Salmon - https://doi.org/10.1038/nmeth.4197

Usage

salmonEC(
  paths,
  tx2gene,
  multigene = FALSE,
  ignoreTxVersion = FALSE,
  ignoreAfterBar = FALSE,
  quiet = FALSE
)
salmonEC(
  paths,
  tx2gene,
  multigene = FALSE,
  ignoreTxVersion = FALSE,
  ignoreAfterBar = FALSE,
  quiet = FALSE
)

Arguments

`paths`	'Charachter' or 'character vector', path specifying the location of the 'eq_classes.txt' files generated with salmon.
`tx2gene`	A 'dataframe' linking transcript identifiers to their corresponding gene identifiers. Transcript identifiers must be in a column 'isoform_id'. Corresponding gene identifiers must be in a column 'gene_id'.
`multigene`	'Logical', should equivalence classes that are compatible with multiple genes be retained? Default is 'FALSE', removing such ambiguous equivalence classes.
`ignoreTxVersion`	logical, whether to split the isoform id on the '.' character to remove version information to facilitate matching with the isoform id in 'tx2gene' (default FALSE).
`ignoreAfterBar`	logical, whether to split the isoform id on the '\|' character to facilitate matching with the isoform id in 'tx2gene' (default FALSE).
`quiet`	'Logical', set 'TRUE' to avoid displaying messages.

Value

A list with two elements. The first element 'counts' is a sparse count matrix with equivalence class identifiers in the rows. If multiple paths are specified, the columns are in the same order as the paths. The second element 'tx2gene_matched' allows for linking those identifiers to their respective transcripts and genes.

Details

Author(s)

Jeroen Gilis

Scale inferential replicate counts

Description

A helper function to scale the inferential replicates to the mean sequencing depth. The scaling takes into account a robust estimator of size factor (median ratio method is used). First, counts are corrected per row using the effective lengths (for gene counts, the average transcript lengths), then scaled per column to the geometric mean sequence depth, and finally are adjusted per-column up or down by the median ratio size factor to minimize systematic differences across samples.

Usage

scaleInfReps(
  y,
  lengthCorrect = TRUE,
  meanDepth = NULL,
  sfFun = NULL,
  minCount = 10,
  minN = 3,
  saveMeanScaled = FALSE,
  quiet = FALSE
)
scaleInfReps(
  y,
  lengthCorrect = TRUE,
  meanDepth = NULL,
  sfFun = NULL,
  minCount = 10,
  minN = 3,
  saveMeanScaled = FALSE,
  quiet = FALSE
)

Arguments

`y`	a SummarizedExperiment with: `infReps` a list of inferential replicate count matrices, `counts` the estimated counts matrix, and `length` the effective lengths matrix
`lengthCorrect`	whether to use effective length correction (default is TRUE)
`meanDepth`	(optional) user can specify a different mean sequencing depth. By default the geometric mean sequencing depth is computed
`sfFun`	(optional) size factors function. An alternative to the median ratio can be provided here to adjust the scaledTPM so as to remove remaining library size differences. Alternatively, one can provide a numeric vector of size factors
`minCount`	for internal filtering, the minimum count
`minN`	for internal filtering, the minimum sample size at `minCount`
`saveMeanScaled`	store the mean of scaled inferential replicates as an assay 'meanScaled'
`quiet`	display no messages

Value

a SummarizedExperiment with the inferential replicates as scaledTPM with library size already corrected (no need for further normalization). A column log10mean is also added which is the log10 of the mean of scaled counts across all samples and all inferential replicates.

Examples


y <- makeSimSwishData()
y <- scaleInfReps(y)

y <- makeSimSwishData()
y <- scaleInfReps(y)

Function for splitting SummarizedExperiment into separate RDS files

Description

The splitSwish function splits up the y object along genes and writes a Snakefile that can be used with Snakemake to distribute running swish across genes. This workflow is primarily designed for large single cell datasets, and so the default is to not perform length correction within the distributed jobs. See the alevin section of the vignette for an example. See the Snakemake documention for details on how to run and customize a Snakefile: https://snakemake.readthedocs.io

Usage

splitSwish(y, nsplits, prefix = "swish", snakefile = NULL, overwrite = FALSE)
splitSwish(y, nsplits, prefix = "swish", snakefile = NULL, overwrite = FALSE)

Arguments

`y`	a SummarizedExperiment
`nsplits`	integer, how many pieces to break `y` into
`prefix`	character, the path of the RDS files to write out, e.g. `prefix="/path/to/swish"` will generate `swish.rds` files at this path
`snakefile`	character, the path of a Snakemake file, e.g. `Snakefile`, that should be written out. If `NULL`, then no `Snakefile` is written out
`overwrite`	logical, whether the `snakefile` and RDS files (`swish1.rds`, ...) should overwrite existing files

Value

nothing, files are written out

References

Compression and splitting across jobs:

Snakemake:

Koster, J., Rahmann, S. (2012) Snakemake - a scalable bioinformatics workflow engine. Bioinformatics. https://doi.org/10.1093/bioinformatics/bts480

Swish method: differential expression accounting for inferential uncertainty

Description

The Swish method, or "SAMseq With Inferential Samples Helps". Performs non-parametric inference on rows of y for various experimental designs. See References for details.

Usage

swish(
  y,
  x,
  cov = NULL,
  pair = NULL,
  interaction = FALSE,
  cor = c("none", "spearman", "pearson"),
  nperms = 100,
  estPi0 = FALSE,
  qvaluePkg = "qvalue",
  pc = 5,
  nRandomPairs = 30,
  fast = NULL,
  returnNulls = FALSE,
  quiet = FALSE
)
swish(
  y,
  x,
  cov = NULL,
  pair = NULL,
  interaction = FALSE,
  cor = c("none", "spearman", "pearson"),
  nperms = 100,
  estPi0 = FALSE,
  qvaluePkg = "qvalue",
  pc = 5,
  nRandomPairs = 30,
  fast = NULL,
  returnNulls = FALSE,
  quiet = FALSE
)

Arguments

`y`	a SummarizedExperiment containing the inferential replicate matrices of median-ratio-scaled TPM as assays 'infRep1', 'infRep2', etc.
`x`	the name of the condition variable. A factor with two levels for a two group analysis (possible to adjust for covariate or matched samples, see next two arguments). The log fold change is computed as non-reference level over reference level (see vignette: 'Note on factor levels')
`cov`	the name of the covariate for adjustment. If provided a stratified Wilcoxon in performed. Cannot be used with `pair`, unless using either `interaction` or `cor`
`pair`	the name of the pair variable, which should be the number of the pair. Can be an integer or factor. If specified, a signed rank test is used to build the statistic by default. Note: For simple paired designs, see use of `fast=1` for a much faster implementation of paired testing using a one-sample z-score test statistic. All samples across `x` must be pairs if this argument is specified. Cannot be used with `cov`, unless using either `interaction` or `cor`
`interaction`	logical, whether to perform a test of an interaction between `x` and `cov`. Can use `pair` or not. See Details.
`cor`	character, whether to compute correlation of `x` with the log counts, and signifance testing on the correlation as a test statistic. Either `"spearman"` or `"pearson"` correlations can be computed. For Spearman the correlation is computed over ranks of `x` and ranks of inferential replicates. For Pearson, the correlation is computed for `x` and log2 of the inferential replicates plus `pc`. Default is `"none"`, e.g. two-group comparison using the rank sum test or other alternatives listed above. Additionally, correlation can be computed between a continuous variable `cov` and log fold changes across `x` matched by `pair`
`nperms`	the number of permutations. if set above the possible number of permutations, the function will print a message that the value is set to the maximum number of permutations possible
`estPi0`	logical, whether to estimate pi0
`qvaluePkg`	character, which package to use for q-value estimation, `samr` or `qvalue`
`pc`	pseudocount for finite estimation of `log2FC`, not used in calculation of test statistics, `locfdr` or `qvalue`
`nRandomPairs`	the number of random pseudo-pairs (only used with `interaction=TRUE` and un-matched samples) to use to calculate the test statistic
`fast`	an integer (0 or 1), toggles different methods based on speed, currently only relevant for simple paired analysis. For simple paired design, `fast=1` triggers the use of a one-sample z-score instead of a signed rank statistic. The one-sample z-score is much faster (can be >10x faster), by avoiding the expensive re-computation of ranks during permutations. `fast=1` is not relevant for `interaction` or `cor` type designs
`returnNulls`	logical, only return the `stat` vector, the `log2FC` vector, and the `nulls` matrix (default FALSE)
`quiet`	display no messages

Details

interaction: The interaction tests are different than the other tests produced by swish, in that they focus on a difference in the log2 fold change across levels of x when comparing the two levels in cov. If pair is specified, this will perform a Wilcoxon rank sum test on the two groups of matched sample LFCs. If pair is not included, multiple random pairs of samples within the two groups are chosen, and again a Wilcoxon rank sum test compared the LFCs across groups.

Value

a SummarizedExperiment with metadata columns added: the statistic (either a centered Wilcoxon Mann-Whitney or a signed rank statistic, aggregated over inferential replicates), a log2 fold change (the median over inferential replicates, and averaged over pairs or groups (if groups, weighted by sample size), the local FDR and q-value, as estimated by the samr package.

References

The citation for swish method is:

The swish method builds upon the SAMseq method, and extends it by incorporating inferential uncertainty, as well as providing methods for additional experimental designs (see vignette).

For reference, the publication describing the SAMseq method is:

Jun Li and Robert Tibshirani "Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data" Stat Methods Med Res (2013). https://doi.org/10.1177/0962280211428386

Examples


library(SummarizedExperiment)
set.seed(1)
y <- makeSimSwishData()
y <- scaleInfReps(y)
y <- labelKeep(y)
y <- swish(y, x="condition")

# histogram of the swish statistics
hist(mcols(y)$stat, breaks=40, col="grey")
cols = rep(c("blue","purple","red"),each=2)
for (i in 1:6) {
  arrows(mcols(y)$stat[i], 20,
         mcols(y)$stat[i], 10,
         col=cols[i], length=.1, lwd=2)
}

# plot inferential replicates
plotInfReps(y, 1, "condition")
plotInfReps(y, 3, "condition")
plotInfReps(y, 5, "condition")

library(SummarizedExperiment)
set.seed(1)
y <- makeSimSwishData()
y <- scaleInfReps(y)
y <- labelKeep(y)
y <- swish(y, x="condition")

# histogram of the swish statistics
hist(mcols(y)$stat, breaks=40, col="grey")
cols = rep(c("blue","purple","red"),each=2)
for (i in 1:6) {
  arrows(mcols(y)$stat[i], 20,
         mcols(y)$stat[i], 10,
         col=cols[i], length=.1, lwd=2)
}

# plot inferential replicates
plotInfReps(y, 1, "condition")
plotInfReps(y, 3, "condition")
plotInfReps(y, 5, "condition")

Package 'fishpond'

Help Index

Fishpond: downstream methods and tools for expression data

Description

Details

References

Read statistics and nulls from CSV file

Description

Usage

Arguments

Value

Construct a sparse matrix of transcript compatibility counts from alevin output

Description

Usage

Arguments

Value

Details

Author(s)

Compute inferential relative variance (InfRV)

Description

Usage

Arguments

Details

Value

References

deswish: DESeq2-apeglm With Inferential Samples Helps

Description

Usage

Arguments

Value

References

Examples

Obtain a trace of inferential replicates for a sample

Description

Usage

Arguments

Value

Examples

Import allelic counts as a SummarizedExperiment

Description

Usage

Arguments

Details

Value

References

Create isoform proportions from scaled data

Description

Usage

Arguments

Value

Label rows to keep based on minimal count

Description

Usage

Arguments

Value

Examples

Load in data from alevin-fry USA mode

Description

Usage

Arguments

Value

Details about loadFry

Details about the output format

Author(s)

References

Examples

Make pseudo-inferential replicates from mean and variance

Description

Usage

Arguments

Details

Value

References

Examples

Make simulated data for swish for examples/testing

Description

Usage

Arguments

Value

Examples

Details about `loadFry`