Title: | Fishpond: downstream methods and tools for expression data |
---|---|
Description: | Fishpond contains methods for differential transcript and gene expression analysis of RNA-seq data using inferential replicates for uncertainty of abundance quantification, as generated by Gibbs sampling or bootstrap sampling. Also the package contains a number of utilities for working with Salmon and Alevin quantification files. |
Authors: | Anqi Zhu [aut, ctb], Michael Love [aut, cre], Avi Srivastava [aut, ctb], Rob Patro [aut, ctb], Joseph Ibrahim [aut, ctb], Hirak Sarkar [ctb], Euphy Wu [ctb], Noor Pratap Singh [ctb], Scott Van Buren [ctb], Dongze He [ctb], Steve Lianoglou [ctb], Wes Wilson [ctb], Jeroen Gilis [ctb] |
Maintainer: | Michael Love <[email protected]> |
License: | GPL-2 |
Version: | 2.13.0 |
Built: | 2024-11-29 08:18:22 UTC |
Source: | https://github.com/bioc/fishpond |
This package provides statistical methods and other tools for working with Salmon and Alevin quantification of RNA-seq data. Fishpond contains the Swish non-parametric method for detecting differential transcript expression (DTE). Swish can also be used to detect differential gene expresion (DGE), to perform allelic analysis, or to assess changes in isoform proportions.
The main Swish functions are:
scaleInfReps
- scaling transcript or gene expression data
labelKeep
- labelling which features have sufficient counts
swish
- perform non-parametric differential analysis
Plots, e.g., plotMASwish
, plotInfReps
All software-related questions should be posted to the Bioconductor Support Site:
https://support.bioconductor.org
The code can be viewed at the GitHub repository, which also lists the contributor code of conduct:
https://github.com/mikelove/fishpond
Swish method:
Zhu, A., Srivastava, A., Ibrahim, J.G., Patro, R., Love, M.I. (2019) Nonparametric expression analysis using inferential replicate counts. Nucleic Acids Research. https://doi.org/10.1093/nar/gkz622
Compression, makeInfReps and splitSwish:
Van Buren, S., Sarkar, H., Srivastava, A., Rashid, N.U., Patro, R., Love, M.I. (2020) Compression of quantification uncertainty for scRNA-seq counts. bioRxiv. https://doi.org/10.1101/2020.07.06.189639
After running splitSwish
and the associated
Snakefile
, this function can be used to gather and
add the results to the original object. See the alevin
section of the vignette for an example.
addStatsFromCSV(y = NULL, infile, estPi0 = FALSE)
addStatsFromCSV(y = NULL, infile, estPi0 = FALSE)
y |
a SummarizedExperiment (if NULL, function will output a data.frame) |
infile |
character, path to the |
estPi0 |
logical, see |
the SummarizedExperiment with metadata columns added,
or if y
is NULL, a data.frame of compiled results
Constructs a UMI count matrix with equivalence class identifiers in the rows and barcode identifiers in the columns. The count matrix is generated from one or multiple 'bfh.txt' files that have been created by running alevin-fry with the –dumpBFH flag. Alevin-fry - https://doi.org/10.1186/s13059-019-1670-y
alevinEC( paths, tx2gene, multigene = FALSE, ignoreTxVersion = FALSE, ignoreAfterBar = FALSE, quiet = FALSE )
alevinEC( paths, tx2gene, multigene = FALSE, ignoreTxVersion = FALSE, ignoreAfterBar = FALSE, quiet = FALSE )
paths |
'Charachter' or 'character vector', path specifying the location of the 'bfh.txt' files generated with alevin-fry. |
tx2gene |
A 'dataframe' linking transcript identifiers to their corresponding gene identifiers. Transcript identifiers must be in a column 'isoform_id'. Corresponding gene identifiers must be in a column 'gene_id'. |
multigene |
'Logical', should equivalence classes that are compatible with multiple genes be retained? Default is 'FALSE', removing such ambiguous equivalence classes. |
ignoreTxVersion |
logical, whether to split the isoform id on the '.' character to remove version information to facilitate matching with the isoform id in 'tx2gene' (default FALSE). |
ignoreAfterBar |
logical, whether to split the isoform id on the '|' character to facilitate matching with the isoform id in 'tx2gene' (default FALSE). |
quiet |
'Logical', set 'TRUE' to avoid displaying messages. |
A list with two elements. The first element 'counts' is a sparse count matrix with equivalence class identifiers in the rows and barcode identifiers followed by an underscore and a sample identifier in the columns. The second element 'tx2gene_matched' allows for linking the equivalence class identifiers to their respective transcripts and genes.
The resulting count matrix uses equivalence class identifiers as rownames. These can be linked to respective transcripts and genes using the 'tx2gene_matched' element of the output. Specifically, if the equivalence class identifier reads 1|2|8, then the equivalence class is compatible with the transcripts and their respective genes in rows 1, 2 and 8 of 'tx2gene_matched'.
Jeroen Gilis
InfRV
is a useful quantity for comparing groups of features
(transcripts, genes, etc.) by inferential uncertainty.
This function provides computation of the mean InfRV over samples,
per feature, stored in mcols(y)$meanInfRV
.
computeInfRV(y, pc = 5, shift = 0.01, meanVariance, useCounts = FALSE)
computeInfRV(y, pc = 5, shift = 0.01, meanVariance, useCounts = FALSE)
y |
a SummarizedExperiment |
pc |
a pseudocount parameter for the denominator |
shift |
a final shift parameter |
meanVariance |
logical, use pre-computed inferential mean
and variance assays instead of |
useCounts |
logical, whether to use the MLE count matrix for the mean instead of mean of inferential replicates. this argument is for backwards compatability, as previous versions used counts. Default is FALSE |
InfRV is defined in Zhu et al. (2019) as:
, using the inferential
sample variance and sample mean. This formulation takes the
non-Poisson part of the inferential variance and scales by the
mean, which effectively stabilizes inferential uncertainty over
mean count. In practice, we also add
pc
to the denominator and
shift
to the final quantity, to facilitate visualization.
This function also computes and adds the mean and variance of inferential
replicates, which can be useful ahead of plotInfReps
.
Note that InfRV is not used in the swish
statistical method (for generating test statistics, p-values
or q-values), it is just for visualization.
a SummarizedExperiment with meanInfRV
in the metadata columns
Anqi Zhu, Avi Srivastava, Joseph G Ibrahim, Rob Patro, Michael I Love "Nonparametric expression analysis using inferential replicate counts" Nucleic Acids Research (2019). https://doi.org/10.1093/nar/gkz622
The DESeq2-apeglm With Inferential Samples implementation supposes
a hierarchical distribution of log2 fold changes.
The final posterior standard deviation is calculated by
adding the posterior variance from modeling biological replicates
computed by apeglm
, and the observed variance on the posterior mode
over inferential replicates. This function requires the DESeq2 and
apeglm packages to be installed and will print an error if they are
not found.
deswish(y, x, coef)
deswish(y, x, coef)
y |
a SummarizedExperiment containing the inferential
replicate matrices, as output by |
x |
the design matrix |
coef |
the coefficient to test (see |
a SummarizedExperiment with metadata columns added: the log2 fold change and posterior SD using inferential replicates, and the original log2 fold change (apeglm) and its posterior SD
The DESeq
and lfcShrink
function in the DESeq2
package:
Zhu, Ibrahim, Love "Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences" Bioinformatics (2018).
Love, Huber, Anders "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2" Genome Biology (2014).
# a small example... 500 genes, 10 inf reps y <- makeSimSwishData(m=500, numReps=10) y <- labelKeep(y) #y <- deswish(y, ~condition, "condition_2_vs_1")
# a small example... 500 genes, 10 inf reps y <- makeSimSwishData(m=500, numReps=10) y <- labelKeep(y) #y <- deswish(y, ~condition, "condition_2_vs_1")
Simple helper function to obtain a trace (e.g. MCMC trace)
of the ordered inferential replicates for one samples.
Supports either multiple features, idx
, or multiple
samples, samp_idx
(not both). Returns a tidy
data.frame for easy plotting.
getTrace(y, idx, samp_idx)
getTrace(y, idx, samp_idx)
y |
a SummarizedExperiment with inferential replicates
as assays |
idx |
the names or row numbers of the gene or transcript to plot |
samp_idx |
the names or column numbers of the samples to plot |
a data.frame with the counts along the interential replicates, possible with additional columns specifying feature or sample
y <- makeSimSwishData() getTrace(y, "gene-1", "s1")
y <- makeSimSwishData() getTrace(y, "gene-1", "s1")
Read in Salmon quantification of allelic counts from a
diploid transcriptome. Assumes that diploid transcripts
are marked with the following suffix: an underscore and
a consistent symbol for each of the two alleles,
e.g. ENST123_M
and ENST123_P
,
or ENST123_alt
and ENST123_ref
, etc.
importAllelicCounts
requires the tximeta package.
Further information in Details below.
importAllelicCounts( coldata, a1, a2, format = c("wide", "assays"), tx2gene = NULL, ... )
importAllelicCounts( coldata, a1, a2, format = c("wide", "assays"), tx2gene = NULL, ... )
coldata |
a data.frame as used in |
a1 |
the symbol for the effect allele |
a2 |
the symbol for the non-effect allele |
format |
either |
tx2gene |
optional, a data.frame with first column indicating
transcripts, second column indicating genes (or any other transcript
grouping). Alternatively, this can be a GRanges object with
required columns |
... |
any arguments to pass to tximeta |
Requirements - There must be exactly two alleles for each
transcript, and the --keep-duplicates
option should be used
in Salmon indexing to avoid removal of transcripts with identical
sequence. The output object has half the number of transcripts,
with the two alleles either stored in a "wide"
object, or as
re-named "assays"
. Note carefully that the symbol provided
to a1
is used as the effect allele, and a2
is used as
the non-effect allele (see the format
argument description
and Value description below).
tx2gene - The two columns should include the a1
and
a2
suffix for the transcripts and genes/groups, or those
will be added internally, if it is detected that the first
transcript does not have these suffices. For example if
_alt
or _ref
, or _M
or _P
(as indicated
by the a1
and a2
arguments) are not present in the
table, the table rows will be duplicated with those suffices added
on behalf of the user. If tx2gene
is not provided, the
output object will be transcript-level. Do not attempt to set the
txOut
argument, it will conflict with internal calls to
downstream functions. If the a1/a2 suffices are not at the end of
the transcript name in the quantification files,
e.g. ENST123_M|<metadata>
, then ignoreAfterBar=TRUE
can be used to match regardless of the string following |
in
the quantification files.
skipMeta=TRUE
is used, as it is assumed the diploid
transcriptome does not match any reference transcript
collection. This may change in future iterations of the function,
depending on developments in upstream annotations and software.
If tx2gene
is a GRanges object, the rowRanges of the output
will be the reduced ranges of the grouped input ranges, with
tx_id
collapsed into a CharacterList, and TSS positions
saved as an IntegerList, if these are not equal among the transcripts
of a group.
Other metadata columns are not manipulated, just the metadata
for the first range is returned.
a SummarizedExperiment, with allele counts (and other data)
combined into a wide matrix [a2 | a1]
, or as assays (a1, then a2).
The original strings associated with a1 and a2 are stored in the
metadata of the object, in the alleles
list element.
Note the reference level of se$allele
will be "a2"
,
such that comparisons by default will be a1 vs a2
(effect vs non-effect).
Euphy Wu, Noor P. Singh, Kwangbom Choi, Mohsen Zakeri, Matthew Vincent, Gary A. Churchill, Cheryl L. Ackert-Bicknell, Rob Patro, Michael I. Love. "Detecting isoform-level allelic imbalance accounting for inferential uncertainty" bioRxiv (2022) https://doi.org/10.1101/2022.08.12.503785
Takes output of scaled (and optionally filtered) counts
and returns isoform proportions by dividing out the
total scaled count for the gene for each sample.
The operation is performed on the counts
assay,
then creating a new assay called isoProp
,
and on all of the inferential replicates, turning them
from counts into isoform proportions. Any transcripts
(rows) from single isoform genes are removed, and the
transcripts will be re-ordered by gene ID.
isoformProportions(y, geneCol = "gene_id", quiet = FALSE)
isoformProportions(y, geneCol = "gene_id", quiet = FALSE)
y |
a SummarizedExperiment |
geneCol |
the name of the gene ID column in the
metadata columns for the rows of |
quiet |
display no messages |
a SummarizedExperiment, with single-isoform transcripts removed, and transcripts now ordered by gene
Adds a column keep
to mcols(y)
that specifies
which rows of the SummarizedExperiment will be included
in statistical testing. Rows are not removed, just marked
with the logical keep
.
labelKeep(y, minCount = 10, minN = 3, x)
labelKeep(y, minCount = 10, minN = 3, x)
y |
a SummarizedExperiment |
minCount |
the minimum count |
minN |
the minimum sample size at |
x |
the name of the condition variable, will
use the smaller of the two groups to set |
a SummarizedExperiment with a new column keep
in mcols(y)
y <- makeSimSwishData() y <- scaleInfReps(y) y <- labelKeep(y)
y <- makeSimSwishData() y <- scaleInfReps(y) y <- labelKeep(y)
Enables easy loading of sparse data matrices provided by alevin-fry USA mode.
loadFry(fryDir, outputFormat = "scRNA", nonzero = FALSE, quiet = FALSE)
loadFry(fryDir, outputFormat = "scRNA", nonzero = FALSE, quiet = FALSE)
fryDir |
path to the output directory returned by
alevin-fry quant command. This directory should contain a
|
outputFormat |
can be either be a list that defines the
desired format of the output |
nonzero |
whether to filter cells with non-zero expression
value across all genes (default |
quiet |
logical whether to display no messages |
A SingleCellExperiment
object that contains one
or more assays. Each assay consists of a gene by cell count matrix.
The row names are feature names, and the column names are cell
barcodes
loadFry
This function consumes the result folder returned by running
alevin-fry quant in unspliced, spliced, ambiguous (USA)
quantification mode, and returns a SingleCellExperiment
object
that contains a final count for each gene within each cell. In
USA mode, alevin-fry quant returns a count matrix contains three
types of count for each feature (gene) within each sample (cell
or nucleus), which represent the spliced mRNA count of the gene (S),
the unspliced mRNA count of the gene (U), and the count of UMIs whose
splicing status is ambiguous for the gene (A). For each assay
defined by outputFormat
, these three counts of a gene
within a cell will be summed to get the final count of the gene
according to the rule defined in the outputFormat
. The
returned object will contains the desired assays defined by
outputFormat
, with rownames as the barcode of samples and
colnames as the feature names.
The outputFormat
argument takes either be a list that defines
the desired format of the output
SingleCellExperiment
object or a string that represents one of
the pre-defined output format.
Currently the pre-defined formats
of the output SingleCellExperiment
object are:
This format is recommended for single cell experiments.
It returns a counts
assay that contains the S+A count of each gene in each cell,
and a unspliced
assay that contains the U count of each gene in each cell.
These three formats are the same.
They return a counts
assay that contains the U+S+A count of each gene in
each cell without any extra layers. "snRNA" is recommended for single-nucleus
RNA-sequencing experiments. "raw" is recommended for mimicing CellRanger 7's behavior,
which returns this format for both single-cell and single-nucleus experiments.
This format returns a counts
assay that contains the S+A
count of each gene in each cell.
This format puts the three kinds of counts into three separate assays,
which are unspliced
, spliced
and ambiguous
.
This format contains two assays.
The spliced
assay contains the S+A count of each gene in each cell.
The unspliced
assay contains the U counts of each gene in each cell.
This format is for direct entry into velociraptor R package or other scVelo downstream analysis pipeline for velocity analysis in R with Bioconductor. It adds the expected "S"-pliced assay and removes errors for size factors being non-positive.
A custom output format can be defined using a list. Each element in the list
defines an assay in the output SingleCellExperiment
object.
The name of an element in the list will be the name of the corresponding
assay in the output object. Each element in the list should be defined as
a vector that takes at least one of the three kinds of count, which are U, S and A.
See the provided toy example for defining a custom output format.
Dongze He, with contributions from Steve Lianoglou, Wes Wilson
alevin-fry publication:
He, D., Zakeri, M., Sarkar, H. et al. "Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data." Nature Methods 19, 316–322 (2022). https://doi.org/10.1038/s41592-022-01408-3
# Get path for minimal example avelin-fry output dir testdat <- fishpond:::readExampleFryData("fry-usa-basic") # This is exactly how the velocity format defined internally. custom_velocity_format <- list("spliced"=c("S","A"), "unspliced"=c("U")) # Load alevin-fry gene quantification in velocity format sce <- loadFry(fryDir=testdat$parent_dir, outputFormat=custom_velocity_format) SummarizedExperiment::assayNames(sce) # Load the same data but use pre-defined, velociraptor R pckage desired format scvelo_format <- "scVelo" scev <- loadFry(fryDir=testdat$parent_dir, outputFormat=scvelo_format, nonzero=TRUE) SummarizedExperiment::assayNames(scev)
# Get path for minimal example avelin-fry output dir testdat <- fishpond:::readExampleFryData("fry-usa-basic") # This is exactly how the velocity format defined internally. custom_velocity_format <- list("spliced"=c("S","A"), "unspliced"=c("U")) # Load alevin-fry gene quantification in velocity format sce <- loadFry(fryDir=testdat$parent_dir, outputFormat=custom_velocity_format) SummarizedExperiment::assayNames(sce) # Load the same data but use pre-defined, velociraptor R pckage desired format scvelo_format <- "scVelo" scev <- loadFry(fryDir=testdat$parent_dir, outputFormat=scvelo_format, nonzero=TRUE) SummarizedExperiment::assayNames(scev)
Makes pseudo-inferential replicate counts from
mean
and variance
assays. The simulated
counts are drawn from a negative binomial distribution,
with mu=mean
and size
set using a method
of moments estimator for dispersion.
makeInfReps(y, numReps, minDisp = 0.001)
makeInfReps(y, numReps, minDisp = 0.001)
y |
a SummarizedExperiment |
numReps |
how many inferential replicates |
minDisp |
the minimal dispersion value, set after method of moments estimation from inferential mean and variance |
Note that these simulated counts only reflect marginal
variance (one transcript or gene at a time),
and do not capture the covariance of counts across
transcripts or genes, unlike imported inferential
replicate data. Therefore, makeInfReps
should
not be used with summarizeToGene
to create
gene-level inferential replicates if inferential
replicates were originally created on the transcript
level. Instead, import the original inferential
replicates.
a SummarizedExperiment
Van Buren, S., Sarkar, H., Srivastava, A., Rashid, N.U., Patro, R., Love, M.I. (2020) Compression of quantification uncertainty for scRNA-seq counts. bioRxiv. https://doi.org/10.1101/2020.07.06.189639
library(SummarizedExperiment) mean <- matrix(1:4,ncol=2) variance <- mean se <- SummarizedExperiment(list(mean=mean, variance=variance)) se <- makeInfReps(se, numReps=50)
library(SummarizedExperiment) mean <- matrix(1:4,ncol=2) variance <- mean se <- SummarizedExperiment(list(mean=mean, variance=variance)) se <- makeInfReps(se, numReps=50)
Makes a small swish dataset for examples and testing.
The first six genes have some differential expression
evidence in the counts, with varying degree of inferential
variance across inferential replicates (1-2: minor,
3-4: some, 5-6: substantial). The 7th and 8th
genes have all zeros to demonstrate labelKeep
.
makeSimSwishData( m = 1000, n = 10, numReps = 20, null = FALSE, meanVariance = FALSE, allelic = FALSE, diffAI = FALSE, dynamicAI = FALSE )
makeSimSwishData( m = 1000, n = 10, numReps = 20, null = FALSE, meanVariance = FALSE, allelic = FALSE, diffAI = FALSE, dynamicAI = FALSE )
m |
number of genes |
n |
number of samples |
numReps |
how many inferential replicates to generate |
null |
logical, whether to make an all null dataset |
meanVariance |
logical, whether to output only mean and variance of inferential replicates |
allelic |
logical, whether to make an allelic sim dataset |
diffAI |
logical, whether to make a differential allelic sim dataset |
dynamicAI |
logical, whether to make a dynamic allelic sim dataset |
a SummarizedExperiment
library(SummarizedExperiment) y <- makeSimSwishData() assayNames(y)
library(SummarizedExperiment) y <- makeSimSwishData() assayNames(y)
This helper function takes either a TxDb/EnsDb or
GRanges object as input and outputs a GRanges object
where transcripts are aggregated to the gene + TSS
(transcription start site). For nearby TSS that should
be grouped together, see maxgap
.
makeTx2Tss(x, maxgap = 0)
makeTx2Tss(x, maxgap = 0)
x |
either TxDb/EnsDb or GRanges object. The GRanges
object should have metadata columns |
maxgap |
integer, number of basepairs to use determining whether to combine nearby TSS |
GRanges with columns tx_id
, tss
, and
group_id
## Not run: library(EnsDb.Hsapiens.v86) edb <- EnsDb.Hsapiens.v86 t2t <- makeTx2Tss(edb) ## End(Not run)
## Not run: library(EnsDb.Hsapiens.v86) edb <- EnsDb.Hsapiens.v86 t2t <- makeTx2Tss(edb) ## End(Not run)
This function is called by the Snakefile
that is generated
by splitSwish
. See alevin example in the vignette.
As such, it doesn't need to be run by users in an interactive
R session.
miniSwish( infile, outfile, numReps = 20, lengthCorrect = FALSE, overwrite = FALSE, ... )
miniSwish( infile, outfile, numReps = 20, lengthCorrect = FALSE, overwrite = FALSE, ... )
infile |
path to an RDS file of a SummarizedExperiment |
outfile |
a CSV file to write out |
numReps |
how many inferential replicates to generate |
lengthCorrect |
logical, see |
overwrite |
logical, whether |
... |
arguments passed to |
Note that the default for length correction is FALSE, as
opposed to the default in scaleInfReps
which
is TRUE. The default for numReps
here is 20.
nothing, files are written out
Plot allelic data (allelic proportions, isoform propostions)
in a gene context leveraging the Gviz package. See the allelic
vignette for example usage. TPM and count filters are used by
default to clean up the plot of features with minimal signal;
note that the isoform proportion displayed at the bottom of the
plot is among the features that pass the filtering steps.
If the function is not responding, it is likely due to issues
connecting to UCSC servers to draw the ideogram, in this case
set ideogram=FALSE
.
plotAllelicGene( y, gene, db, region = NULL, symbol = NULL, genome = NULL, tpmFilter = 1, isoPropFilter = 0.05, countFilter = 10, pc = 1, transcriptAnnotation = "symbol", labels = list(a2 = "a2", a1 = "a1"), qvalue = TRUE, log2FC = TRUE, ideogram = FALSE, cov = NULL, covFacetIsoform = FALSE, allelicCol = c("dodgerblue", "goldenrod1"), isoformCol = "firebrick", statCol = "black", gridCol = "grey80", baselineCol = "black", titleCol = "black", titleAxisCol = "black", titleBgCol = "white", geneBorderCol = "darkblue", geneFillCol = "darkblue", genomeAxisCol = "black", innerFontCol = "black", ... )
plotAllelicGene( y, gene, db, region = NULL, symbol = NULL, genome = NULL, tpmFilter = 1, isoPropFilter = 0.05, countFilter = 10, pc = 1, transcriptAnnotation = "symbol", labels = list(a2 = "a2", a1 = "a1"), qvalue = TRUE, log2FC = TRUE, ideogram = FALSE, cov = NULL, covFacetIsoform = FALSE, allelicCol = c("dodgerblue", "goldenrod1"), isoformCol = "firebrick", statCol = "black", gridCol = "grey80", baselineCol = "black", titleCol = "black", titleAxisCol = "black", titleBgCol = "white", geneBorderCol = "darkblue", geneFillCol = "darkblue", genomeAxisCol = "black", innerFontCol = "black", ... )
y |
a SummarizedExperiment (see |
gene |
the name of the gene of interest, requires
a column |
db |
either a TxDb or EnsDb object to use for the gene model |
region |
GRanges, the region to be displayed in the Gviz plot. if not specified, will be set according to the gene plus 20 of the total gene extent on either side |
symbol |
alternative to |
genome |
UCSC genome code (e.g. |
tpmFilter |
minimum TPM value (mean over samples) to keep a feature |
isoPropFilter |
minimum percent of isoform proportion to keep a feature |
countFilter |
minimum count value (mean over samples) to keep a feature |
pc |
pseudocount to avoid dividing by zero in allelic proportion calculation |
transcriptAnnotation |
argument passed to Gviz::GeneRegionTrack
( |
labels |
list, labels for a2 (non-effect) and a1 (effect) alleles |
qvalue |
logical, whether to inclue qvalue track |
log2FC |
logical, whether to include log2FC track |
ideogram |
logical, whether to include ideogram track |
cov |
character specifying a factor or integer variable to use
to facet the allelic proportion plots, should be a column in
|
covFacetIsoform |
logical, if |
allelicCol |
the colors of the points and lines for allelic proportion |
isoformCol |
the colors of the points and lines for isoform proportion |
statCol |
the color of the lollipops for q-value and log2FC |
gridCol |
the color of the grid in the data tracks |
baselineCol |
the color of the horizontal baseline for q-value and lo2gFC |
titleCol |
font color of the side titles (track labels) |
titleAxisCol |
axis color of the side titles (track labels) |
titleBgCol |
background color of the side titles (track labels) |
geneBorderCol |
the color of the borders and font in gene region track |
geneFillCol |
the color of the fill in the gene region track |
genomeAxisCol |
line color of the genome axis track |
innerFontCol |
font color of genome axis track, ideogram, and allelic proportion legend |
... |
additional arguments passed to |
nothing, a plot is displayed
The methods for allelic expression analysis are described in:
Euphy Wu, Noor P. Singh, Kwangbom Choi, Mohsen Zakeri, Matthew Vincent, Gary A. Churchill, Cheryl L. Ackert-Bicknell, Rob Patro, Michael I. Love. "Detecting isoform-level allelic imbalance accounting for inferential uncertainty" bioRxiv (2022) https://doi.org/10.1101/2022.08.12.503785
This function makes use of the Gviz package that is described in:
Hahne, F., Ivanek, R. (2016). Visualizing Genomic Data Using Gviz and Bioconductor. In: Mathé, E., Davis, S. (eds) Statistical Genomics. Methods in Molecular Biology, vol 1418. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3578-9_16
Plot allelic ratio heatmap over features and samples using the pheatmap package. The a1/(a2 + a1) ratio is displayed.
plotAllelicHeatmap( y, idx, breaks = NULL, cluster_cols = FALSE, main = "Allelic ratio", stripAfterChar = "-", ... )
plotAllelicHeatmap( y, idx, breaks = NULL, cluster_cols = FALSE, main = "Allelic ratio", stripAfterChar = "-", ... )
y |
a SummarizedExperiment (see |
idx |
a numeric or logical vector of which features to plot |
breaks |
breaks passed along to pheatmap |
cluster_cols |
logical, passed to pheatmap |
main |
title of the plot |
stripAfterChar |
for the column names, if specified will strip allelic identifiers after this character, default is hyphen. set to NULL to avoid this action |
... |
other arguments passed to pheatmap |
nothing, a plot is displayed
The methods for allelic expression analysis are described in:
Euphy Wu, Noor P. Singh, Kwangbom Choi, Mohsen Zakeri, Matthew Vincent, Gary A. Churchill, Cheryl L. Ackert-Bicknell, Rob Patro, Michael I. Love. "Detecting isoform-level allelic imbalance accounting for inferential uncertainty" bioRxiv (2022) https://doi.org/10.1101/2022.08.12.503785
This function makes use of the pheatmap package:
Kolde, Raivo. "Pheatmap: pretty heatmaps." R package version 1.2 (2012): 726.
For datasets with inferential replicates, boxplots are
drawn for the two groups and potentially grouped by
covariates. For datasets with only mean and variance,
points and intervals (95
approximation) are drawn. Additionally, for numeric
x
values, points and intervals will be drawn
and computeInfRV
should be run first
in order to add the mean and variance statistics.
plotInfReps( y, idx, x, cov = NULL, colsDrk = c("dodgerblue", "goldenrod4", "royalblue4", "red3", "purple4", "darkgreen"), colsLgt = c("lightblue1", "goldenrod1", "royalblue1", "salmon1", "orchid1", "limegreen"), xaxis, xlab, ylim, main, mainCol, legend = FALSE, legendPos = "topleft", legendTitle = FALSE, legendCex = 1, useMean = TRUE, q = qnorm(0.975), applySF = FALSE, reorder, thin, shiftX )
plotInfReps( y, idx, x, cov = NULL, colsDrk = c("dodgerblue", "goldenrod4", "royalblue4", "red3", "purple4", "darkgreen"), colsLgt = c("lightblue1", "goldenrod1", "royalblue1", "salmon1", "orchid1", "limegreen"), xaxis, xlab, ylim, main, mainCol, legend = FALSE, legendPos = "topleft", legendTitle = FALSE, legendCex = 1, useMean = TRUE, q = qnorm(0.975), applySF = FALSE, reorder, thin, shiftX )
y |
a SummarizedExperiment (see |
idx |
the name or row number of the gene or transcript |
x |
the name of the condition variable for splitting
and coloring the samples or cells. Also can be a numeric,
e.g. pseudotime, in which case, |
cov |
the name of the covariate for adjustment |
colsDrk |
dark colors for the lines of the boxes |
colsLgt |
light colors for the inside of the boxes |
xaxis |
logical, whether to label the sample numbers.
default is |
xlab |
the x-axis label |
ylim |
y limits |
main |
title |
mainCol |
name of metadata column to use for title (instead of rowname) |
legend |
logical, show simple legend (default FALSE) |
legendPos |
character, position of the legend (default "topleft") |
legendTitle |
logical, whether to add the name of the grouping variable as a title on the legend (default FALSE) |
legendCex |
numeric, size of the legend (default 1) |
useMean |
logical, when inferential replicates
are not present or when |
q |
numeric, the quantile to use when plotting
the intervals when inferential replicates are not
present or when |
applySF |
logical, when inferential replicates are
not present, should |
reorder |
logical, should points within a group defined by condition and covariate be re-ordered by their count value (default is FALSE, except for alevin data) |
thin |
integer, should the mean and interval lines be drawn thin (the default switches from 0 [not thin] to 1 [thinner] at n=150 cells, and from 1 [thinner] to 2 [thinnest] at n=400 cells) |
shiftX |
when |
nothing, a plot is displayed
y <- makeSimSwishData() plotInfReps(y, 3, "condition") y <- makeSimSwishData(n=40) y$batch <- factor(rep(c(1,2,3,1,2,3),c(5,10,5,5,10,5))) plotInfReps(y, 3, "condition", "batch")
y <- makeSimSwishData() plotInfReps(y, 3, "condition") y <- makeSimSwishData(n=40) y$batch <- factor(rep(c(1,2,3,1,2,3),c(5,10,5,5,10,5))) plotInfReps(y, 3, "condition", "batch")
MA plot - log fold change over average counts
plotMASwish(y, alpha = 0.05, sigcolor = "blue", ...)
plotMASwish(y, alpha = 0.05, sigcolor = "blue", ...)
y |
a SummarizedExperiment (see |
alpha |
the FDR threshold for coloring points |
sigcolor |
the color for the significant points |
... |
passed to plot |
nothing, a plot is displayed
y <- makeSimSwishData() y <- scaleInfReps(y) y <- labelKeep(y) y <- swish(y, x="condition") plotMASwish(y)
y <- makeSimSwishData() y <- scaleInfReps(y) y <- labelKeep(y) y <- swish(y, x="condition") plotMASwish(y)
Constructs a count matrix with equivalence class identifiers in the rows. The count matrix is generated from one or multiple 'eq_classes.txt' files that have been created by running salmon with the –dumpEq flag. Salmon - https://doi.org/10.1038/nmeth.4197
salmonEC( paths, tx2gene, multigene = FALSE, ignoreTxVersion = FALSE, ignoreAfterBar = FALSE, quiet = FALSE )
salmonEC( paths, tx2gene, multigene = FALSE, ignoreTxVersion = FALSE, ignoreAfterBar = FALSE, quiet = FALSE )
paths |
'Charachter' or 'character vector', path specifying the location of the 'eq_classes.txt' files generated with salmon. |
tx2gene |
A 'dataframe' linking transcript identifiers to their corresponding gene identifiers. Transcript identifiers must be in a column 'isoform_id'. Corresponding gene identifiers must be in a column 'gene_id'. |
multigene |
'Logical', should equivalence classes that are compatible with multiple genes be retained? Default is 'FALSE', removing such ambiguous equivalence classes. |
ignoreTxVersion |
logical, whether to split the isoform id on the '.' character to remove version information to facilitate matching with the isoform id in 'tx2gene' (default FALSE). |
ignoreAfterBar |
logical, whether to split the isoform id on the '|' character to facilitate matching with the isoform id in 'tx2gene' (default FALSE). |
quiet |
'Logical', set 'TRUE' to avoid displaying messages. |
A list with two elements. The first element 'counts' is a sparse count matrix with equivalence class identifiers in the rows. If multiple paths are specified, the columns are in the same order as the paths. The second element 'tx2gene_matched' allows for linking those identifiers to their respective transcripts and genes.
The resulting count matrix uses equivalence class identifiers as rownames. These can be linked to respective transcripts and genes using the 'tx2gene_matched' element of the output. Specifically, if the equivalence class identifier reads 1|2|8, then the equivalence class is compatible with the transcripts and their respective genes in rows 1, 2 and 8 of 'tx2gene_matched'.
Jeroen Gilis
A helper function to scale the inferential replicates to the mean sequencing depth. The scaling takes into account a robust estimator of size factor (median ratio method is used). First, counts are corrected per row using the effective lengths (for gene counts, the average transcript lengths), then scaled per column to the geometric mean sequence depth, and finally are adjusted per-column up or down by the median ratio size factor to minimize systematic differences across samples.
scaleInfReps( y, lengthCorrect = TRUE, meanDepth = NULL, sfFun = NULL, minCount = 10, minN = 3, saveMeanScaled = FALSE, quiet = FALSE )
scaleInfReps( y, lengthCorrect = TRUE, meanDepth = NULL, sfFun = NULL, minCount = 10, minN = 3, saveMeanScaled = FALSE, quiet = FALSE )
y |
a SummarizedExperiment with: |
lengthCorrect |
whether to use effective length correction (default is TRUE) |
meanDepth |
(optional) user can specify a different mean sequencing depth. By default the geometric mean sequencing depth is computed |
sfFun |
(optional) size factors function. An alternative to the median ratio can be provided here to adjust the scaledTPM so as to remove remaining library size differences. Alternatively, one can provide a numeric vector of size factors |
minCount |
for internal filtering, the minimum count |
minN |
for internal filtering, the minimum sample size
at |
saveMeanScaled |
store the mean of scaled inferential replicates as an assay 'meanScaled' |
quiet |
display no messages |
a SummarizedExperiment with the inferential replicates
as scaledTPM with library size already corrected (no need for further
normalization). A column log10mean
is also added which is the
log10 of the mean of scaled counts across all samples and all inferential
replicates.
y <- makeSimSwishData() y <- scaleInfReps(y)
y <- makeSimSwishData() y <- scaleInfReps(y)
The splitSwish
function splits up the y
object
along genes and writes a Snakefile
that can be used with
Snakemake to distribute running swish
across genes.
This workflow is primarily designed for large single cell datasets,
and so the default is to not perform length correction
within the distributed jobs.
See the alevin section of the vignette for an example. See
the Snakemake documention for details on how to run and customize
a Snakefile
: https://snakemake.readthedocs.io
splitSwish(y, nsplits, prefix = "swish", snakefile = NULL, overwrite = FALSE)
splitSwish(y, nsplits, prefix = "swish", snakefile = NULL, overwrite = FALSE)
y |
a SummarizedExperiment |
nsplits |
integer, how many pieces to break |
prefix |
character, the path of the RDS files to write out,
e.g. |
snakefile |
character, the path of a Snakemake file, e.g.
|
overwrite |
logical, whether the |
nothing, files are written out
Compression and splitting across jobs:
Van Buren, S., Sarkar, H., Srivastava, A., Rashid, N.U., Patro, R., Love, M.I. (2020) Compression of quantification uncertainty for scRNA-seq counts. bioRxiv. https://doi.org/10.1101/2020.07.06.189639
Snakemake:
Koster, J., Rahmann, S. (2012) Snakemake - a scalable bioinformatics workflow engine. Bioinformatics. https://doi.org/10.1093/bioinformatics/bts480
The Swish method, or "SAMseq With Inferential Samples Helps".
Performs non-parametric inference on rows of y
for
various experimental designs. See References for details.
swish( y, x, cov = NULL, pair = NULL, interaction = FALSE, cor = c("none", "spearman", "pearson"), nperms = 100, estPi0 = FALSE, qvaluePkg = "qvalue", pc = 5, nRandomPairs = 30, fast = NULL, returnNulls = FALSE, quiet = FALSE )
swish( y, x, cov = NULL, pair = NULL, interaction = FALSE, cor = c("none", "spearman", "pearson"), nperms = 100, estPi0 = FALSE, qvaluePkg = "qvalue", pc = 5, nRandomPairs = 30, fast = NULL, returnNulls = FALSE, quiet = FALSE )
y |
a SummarizedExperiment containing the inferential replicate matrices of median-ratio-scaled TPM as assays 'infRep1', 'infRep2', etc. |
x |
the name of the condition variable. A factor with two levels for a two group analysis (possible to adjust for covariate or matched samples, see next two arguments). The log fold change is computed as non-reference level over reference level (see vignette: 'Note on factor levels') |
cov |
the name of the covariate for adjustment.
If provided a stratified Wilcoxon in performed.
Cannot be used with |
pair |
the name of the pair variable, which should be the
number of the pair. Can be an integer or factor.
If specified, a signed rank test is used to build the statistic
by default.
Note: For simple paired designs, see use of |
interaction |
logical, whether to perform a test of an interaction
between |
cor |
character, whether to compute correlation of |
nperms |
the number of permutations. if set above the possible number of permutations, the function will print a message that the value is set to the maximum number of permutations possible |
estPi0 |
logical, whether to estimate pi0 |
qvaluePkg |
character, which package to use for q-value estimation,
|
pc |
pseudocount for finite estimation of |
nRandomPairs |
the number of random pseudo-pairs (only used with
|
fast |
an integer (0 or 1), toggles different methods based on speed,
currently only relevant for simple paired analysis. For simple paired design,
|
returnNulls |
logical, only return the |
quiet |
display no messages |
interaction:
The interaction tests are different than the
other tests produced by swish
, in that they focus on a difference
in the log2 fold change across levels of x
when comparing
the two levels in cov
. If pair
is specified, this
will perform a Wilcoxon rank sum test on the two groups
of matched sample LFCs. If pair
is not included, multiple
random pairs of samples within the two groups are chosen,
and again a Wilcoxon rank sum test compared the LFCs across groups.
a SummarizedExperiment with metadata columns added:
the statistic (either a centered Wilcoxon Mann-Whitney
or a signed rank statistic, aggregated over inferential replicates),
a log2 fold change (the median over inferential replicates,
and averaged over pairs or groups (if groups, weighted by sample size),
the local FDR and q-value, as estimated by the samr
package.
The citation for swish
method is:
Anqi Zhu, Avi Srivastava, Joseph G Ibrahim, Rob Patro, Michael I Love "Nonparametric expression analysis using inferential replicate counts" Nucleic Acids Research (2019). https://doi.org/10.1093/nar/gkz622
The swish
method builds upon the SAMseq
method,
and extends it by incorporating inferential uncertainty, as well
as providing methods for additional experimental designs (see vignette).
For reference, the publication describing the SAMseq
method is:
Jun Li and Robert Tibshirani "Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data" Stat Methods Med Res (2013). https://doi.org/10.1177/0962280211428386
library(SummarizedExperiment) set.seed(1) y <- makeSimSwishData() y <- scaleInfReps(y) y <- labelKeep(y) y <- swish(y, x="condition") # histogram of the swish statistics hist(mcols(y)$stat, breaks=40, col="grey") cols = rep(c("blue","purple","red"),each=2) for (i in 1:6) { arrows(mcols(y)$stat[i], 20, mcols(y)$stat[i], 10, col=cols[i], length=.1, lwd=2) } # plot inferential replicates plotInfReps(y, 1, "condition") plotInfReps(y, 3, "condition") plotInfReps(y, 5, "condition")
library(SummarizedExperiment) set.seed(1) y <- makeSimSwishData() y <- scaleInfReps(y) y <- labelKeep(y) y <- swish(y, x="condition") # histogram of the swish statistics hist(mcols(y)$stat, breaks=40, col="grey") cols = rep(c("blue","purple","red"),each=2) for (i in 1:6) { arrows(mcols(y)$stat[i], 20, mcols(y)$stat[i], 10, col=cols[i], length=.1, lwd=2) } # plot inferential replicates plotInfReps(y, 1, "condition") plotInfReps(y, 3, "condition") plotInfReps(y, 5, "condition")