This provides an overview of the R package ALDEx version 2 (ALDEx2) for differential (relative) abundance analysis of high throughput sequencing count compositional data1. However, Nixon et al.2 have found that the scale of the data can be modelled and ALDEx2 now includes scale modelling by default.
The package was developed and used initially for multiple-organism RNA-Seq data generated by high-throughput sequencing platforms (meta-RNA-Seq)3, but testing showed that it performed very well with traditional RNA-Seq datasets4, 16S rRNA gene variable region sequencing5 and selective growth-type (SELEX) experiments6, and that the underlying principles generalize to single-cell sequencing7. The analysis method should be applicable to any type of data that is generated by high-throughput8.
Most recently, Nixon et al.9 showed that the use of the CLR10 made strong assumptions about the scale of the underlying system. ALDEx2 now includes methods to include scale uncertainty and so incorporate both compositional and scale uncertainty together in the analysis. See the scaleSim vignette for more information and a complete walk-through.
There are two ways to download and install the most current of ALDEx2.
The most recent version of the package will be found at github.com/ggloor/ALDEx_bioc.
The package will run with only the base R packages, the standard
tidyverse packages and a minimal Bioconductor install, and is capable of
running several functions with the parallel
package if
installed. It is recommended that the package be run on the most
up-to-date R and Bioconductor versions.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ALDEx2")
Here all dependencies should be installed automatically.
In this case, you may need to install additional packages manually.
## |------------(25%)----------(50%)----------(75%)----------|
ALDEx2 is a different tool than others used for differential abundance testing, but that does not mean it is complicated. Instead, it is incredibly simple once you understand four concepts.
First, the data you collect is a single point-estimate of the data. That means, if you sequenced the same samples, or even the same library, you would get a slightly different answer. We know this because there are a number of early technical replication papers that show exactly this11.
Second, that the data collected are probabilistic; that the count data is a representation of the probability of sampling the gene fragment from the mixture of gene fragments in the library. Many other tools assume the data are counts from a multivariate distribution modelled by a Negative Binomial distribution.
Third, that we can pretty accurately estimate this probabilisitic technical variation. For this, we assume the data collected are from a multivariate Poisson distribution constrained to have the total number of counts that the instrument can deliver12. This can be estimated by sampling from the appropriate distribution, in this case a multinomial Dirichlet distribution.
Fourth, that the use of the CLR (or any other normalization) makes strong assumptions about the scale of the sampled environment. Nixon et al.13 demonstrated how to incorporate scale uncertainty so that the effect of scale on the analysis can be modelled and accounted for.
More formally, the ALDEx2 package estimates per-feature technical variation within each sample using Monte-Carlo instances drawn from the Dirichlet distribution. Sampling from this distribution returns a posterior probability distribution of the observed data under a repeated sampling model. This distribution is converted to a log-ratio that linearizes the differences between features. This log-ratio can be modified to include uncertainty regarding the scale of the data when the log-ratio is calculated. Thus, the values returned by ALDEx2 are posterior estimates of test statistics calculated on log-ratio transformed distributions. ALDEx2 has been altered to ensure that the elements of the posterior statistical tests agree on their sign with the outcome that some edge cases with low predicted p-values are no longer counted as significant at high false discovery rates. Combining two-sample tests into a posterior estimate is somewhat intricate14, but in practice the outcomes should be very similar to those seen before, but edge cases with expected false discovery rates greater than 0.05 have become more rigorous.
ALDEx2 uses by default the centred log-ratio (clr) transformation which is scale invariant15. While this is a convenient assumption for many cases, this can lead to Type 1 and Type 2 errors16 and we now suggest that modest amounts of scale be built into the analysis by default.
The sub-compositional coherence property ensures that the answers obtained are consistent when parts of the dataset are removed (e.g., removal of rRNA reads from RNA-seq studies or rare OTU species from 16S rRNA gene amplicon studies). In the absence of scale, all feature abundance values are expressed relative to the geometric mean abundance of other features in a sample. This is conceptually similar to a quantitative PCR where abundances are expressed relative to a standard: in the case of the clr transformation, the standard is the per-sample geometric mean abundance. See Aitchison (1986) for a complete description. If the question at hand is relative change, then ALDEx2 can do this. If the question at hand is scale dependent, then ALDEx2 can do that too!
aldex
with 2 groups:The ALDEx2 package in Bioconductor is suitable for the comparison of many different experimental designs.
The test dataset is a single gene library with 1600 sequence
variants17 cloned into an expression vector at
approximately equimolar amounts. The wild-type version of the gene
conferred resistance to a topoisomerase toxin. Seven independent growths
of the gene library were conducted under selective and non-selective
conditions and the resulting abundances of each variant was read out by
sequencing a pooled, barcoded library on an Illumina MiSeq. The data
table is accessed by data(selex)
. In this data table, there
are 1600 features and 14 samples. The analysis takes approximately 2
minutes and memory usage tops out at less than 1Gb of RAM on a mobile i7
class processor when we use 128 Dirichlet Monte-Carlo Instances (DMC).
On an M2 Pro chip, this takes approximately 6 seconds. For speed
concerns we use only the last 400 features and perform only 16 DMC. The
command used for ALDEx2 is
presented below:
library(ALDEx2)
#subset only the last 400 features for efficiency
data(selex)
selex.sub <- selex[1200:1600,]
conds <- c(rep("NS", 7), rep("S", 7))
x.aldex <- aldex(selex.sub, conds, mc.samples=16, test="t", effect=TRUE,
include.sample.summary=FALSE, denom="all", verbose=FALSE, paired.test=FALSE, gamma=NULL)
# note default is FDR of 0.05
par(mfrow=c(1,3))
aldex.plot(x.aldex, type="MA", test="welch", xlab="Log-ratio abundance",
ylab="Difference", main='Bland-Altman plot')
aldex.plot(x.aldex, type="MW", test="welch", xlab="Dispersion",
ylab="Difference", main='Effect plot')
aldex.plot(x.aldex, type="volcano", test="welch", xlab="Difference",
ylab="-1(log10(q))", main='Volcano plot')
Figure @ref(fig:three-plots) shows the result of a two sample t-test
and effect size calculation using aldex()
. We set the test
to ‘t’, and effect to TRUE
. The ‘t’ option evaluates the
data as a two-factor experiment using both the Welch’s t and the
Wilcoxon rank tests. The t-test includes an expected value of the false
discovery rate correction. The data can be plotted onto Bland-Altmann18 (MA) or
effect (MW)19 or volcano20 plots for for
two-way tests using the aldex.plot()
function. See the end
of the modular section for examples of the plots.
The simple approach outlined above just calls
aldex.clr, aldex.ttest, aldex.effect
in turn and then
merges the data into one dataframe called x.all
. We will
show these modules in turn, and then examine additional modules.
There is a lot more information that can be obtained than with these three plots. One of the most powerful aspect of ALDEx2 is that everything is calculated on posterior distributions. The underlying posterior distribution can be viewed if you use the individual modules explained next.
For many users, the aldex
function may be sufficient.
However, for power users there is a lot of value to the modular
approach. There are several built-in tools that allow a much more
powerful exploration of the data. In addition, all the intermediate
values are exposed for the underlying entre log-ratio transformed
Dirichlet Monte-Carlo replicate values. This makes it possible for
anyone to add the specific R code for their experimental design — a
guide to these values is outlined below.
We will use the same example and generate the same plots but include
more information that can be useful. First we load the library and the
included selex dataset. Then we set the comparison groups. This must be
a vector of conditions in the same order as the samples in the input
counts table. The aldex
command is calling several other
functions in the background, and each of them returns diagnostics. These
diagnostics are suppressed for this vignette. The code block below shows
the full set of command, but are not run. Each is run in an individual
code block
data(selex)
#subset only the last 400 features for efficiency
selex.sub <- selex[1200:1600,]
conds <- c(rep("NS", 7), rep("S", 7))
x <- aldex.clr(selex.sub, conds, mc.samples=16, denom="all", verbose=F, gamma=1e-3)
x.tt <- aldex.ttest(x, hist.plot=F, paired.test=FALSE, verbose=FALSE,)
x.effect <- aldex.effect(x, CI=T, verbose=F, include.sample.summary=F,
paired.test=FALSE, glm.conds=NULL, useMC=F)
x.all <- data.frame(x.tt,x.effect)
par(mfrow=c(1,2))
aldex.plot(x.all, type="MA", test="welch")
aldex.plot(x.all, type="MW", test="welch")
aldex.clr
moduleThe workflow for the modular approach first generates random
instances of the centred log-ratio transformed values. There are three
inputs: counts table, a vector of conditions, and the number of
Monte-Carlo (DMC) instances; and two main parameters: the level of
verbosity (TRUE or FALSE) and the scale parameter. We recommend 128 or
more mc.samples for the t-test, 1000 for a rigorous effect size
calculation, and at least 16 for ANOVA.21 Additionally, while
in this example, gamma is set to 1e-3 (i.e., almost nothing) we
recommend including gamma in the range of 0.25 to 1 for most datasets.
See below, and the the scale_sim
vignette for more
detail.
This operation is fast.
The output in x
contains a lot of information, and you
should see the ALDEx2 R documentation for a complete explanation. The
impatient can see the commands to access all the available data with
methods(class='aldex.clr')
and the individual methods can
be accessed via help(getFeatures)
substituting getFeatures
for the appropriate method name.
aldex.ttest
moduleThe next operation performs the Welch’s t and Wilcoxon rank test for
the situation when there are only two conditions. There is only one
input: the aldex object from aldex.clr
. Several parameters
modify this test: whether or not to perform a paired test
paired-test
; whether or not to display the histogram of
p-values for the first MC instance hist.plot
; whether or
not to give verbose progress verbose
. The significance
tests are posterior p-values and are calculated as one-tailed tests with
the test statistic doubled to approximate the two-tailed test22. This
gets rid of a small number of false positive p-values that arose because
of the sign of the test statistic being unstable in the underlying
posterior distribution.
This operation is reasonably fast thanks to Thom Quinn and Michelle Nixon.
aldex.effect
moduleWe estimate and report effect sizes based on the within and between
condition values in the case of two conditions. This step is required
for plotting; in our lab we base our conclusions primarily on the output
of this function23. There is one input: the aldex object
from aldex.clr; and several useful parameters. It is possible to include
the 95% confidence interval information for the effect size estimate
with the boolean flag CI=TRUE
. This can be helpful when
deciding whether to include or exclude specific features from
consideration. We find that a large effect that is driven by an outlier
can be because of false positives. New is the option to calculate effect
sizes for paired t-tests or repeated samples with the boolean
paired.test=TRUE
option. In this case the difference
between and difference within values are calculated as the median paired
difference and the median absolute deviation of that difference. The
standardized effect size is their ratio and is usually slightly larger
than the unpaired effect size. The supplied selex dataset is actually a
paired dataset and can be used to explore this option. Other parameters
are include.sample.summary
which gives the expected clr
values for each feature; useMC
for multicore-testing shows
this is usually slower than single core; glm.conds
when
calculating effects for the glm contrasts; and finally
verbose
if you want to more closely follow progress.
The function has been heavily optimized for speed and includes a lot of vectorization under the hood.
aldex.plot
moduleWe merge the data into one object for plotting. We see that the plotted data in Figure 1 and 2 are essentially the same.
Note that plotting for the aldex.glm
function is
different and is shown later.
# merge into one output for convenience
x.all <- data.frame(x.tt,x.effect)
par(mfrow=c(1,3))
aldex.plot(x.all, type="MA", test="welch", main='MA plot')
aldex.plot(x.all, type="MW", test="welch", main='effect plot')
aldex.plot(x.all, type="volcano", test="welch", main='volcano plot')
The aldex.plot()
function gives you a lot of control.
Figure @ref(fig:three2-plots) shows the three plot types; effect
(type=‘MW’), MA or Bland-Altmann (type=‘MA’) and volcano
(type=‘volcano). They are nearly identical to those in
@ref(fig:three-plots), differences are due to the random Monte-Carlo
sampling. You can choose to use statistical tests or only effect sizes
(type=’welch’ or ‘effect’). You can control the title, axis labels and
scales as well. See ?aldex.plot
for all this can do!
ALDEx2
also allows you to plot and explore the underlying posterior
distributions for each part. This is with the
aldex.plotFeature()
function and the output is shown in
Figure @ref(fig:plot-feature). This is a great way to get intuition
around the underlying distributions. We see that the posterior
distributions are skewed, that there is considerable uncertainty in the
difference and effect sizes, and that the tails are very long.
# merge into one output for convenience
# we start with the ouput from aldex.clr
# we provide the name of the row we wish to explore
aldex.plotFeature(x, 'S:E:G:D')
As noted above, the ALDEx2 package generates a posterior distribution of the probability of observing the count given the data collected. Here we show the importance of this approach by examining the 95% CI of the effect size. Throughout, we use a standardized effect size, similar to the Cohen’s d metric, although our effect size is more robust and more conservative (being approximately 0.7 Cohen’s d when the data are Normally distributed)24.
Examining the outputs of the effect, MA and the posterior distribution plots shows a very important point: there is a tremendous amount of latent variation in sequencing data . We observe that there are a few features that have an expected q value that is statistically significantly different, which are both relatively rare and which have a relatively small difference. Even when identifying features based on the expected effect size can be misleading. We find that the safest approach is to identify those features that where the 95% CI of the effect size does not cross 0.
Examining the figures we see that the features that are the rarest are least likely to reproduce the effect size with simple random sampling. The behaviour of the 95% CI metric is exactly in line with our intuition: the precision of estimation of rare features is poor—if you want more confidence in rare features you must spend more money to sequence more deeply. Figure @ref(fig:CI) shows that using both an effect size (or if you wish a p-value cutoff) along with the effect CI cutoff can remove some features that are not truly separable between groups.
With this approach we are accepting the biological variation in the data as received; i.e. we are not inferring any additional biological variation: i.e., the experimental design is always as given. We are identifying those features where simple random sampling of the library would be expected to give the same result every time. This is the approach that was used in (Macklaim et al. 2013), the results of which were independently validated and found to be very robust (Nelson et al. 2015).
This approach requires that the CI=T parameter be set when calculating effect sizes.
The aldex.glm module is included so that the probabilistic compositional approach can be used for complex study designs. This module is substantially slower than the two-comparison tests above, but we think it is worth it if you have complex study designs.
Essentially, the approach is the modular approach above but using a model matrix and covariates supplied to the glm function in R. The values returned are the expected values of the glm function given the inputs. In the example below, we are measuring the predictive value of variables A and B independently. See the documentation for the R formula function, or http://faculty.chicagobooth.edu/richard.hahn/teaching/formulanotation.pdf for more information.
Validation of features that are differential under any of the variables identified by the aldex.glm function should be performed by plotting and for this the aldex.glm.effect function as a post-hoc test.
Note that aldex.clr
will accept the
denom="all"
or a user-defined vector of denominator offsets
when a model matrix is supplied. Therefore, when it is intended that the
downstream analysis is the aldex.glm
function only those
two denominator options are available. This will be addressed in a
future release, and will likely be deprecated because the
aldex.clr
and all downstream functions are fully compatible
with the scale modelling which is both more efficient and more easily
understood.
data(selex)
selex.sub <- selex[1200:1600, ]
covariates <- data.frame("A" = sample(0:1, 14, replace = TRUE),
"B" = c(rep(0, 7), rep(1, 7)),
"Z" = sample(c(1,2,3), 14, replace=TRUE))
mm <- model.matrix(~ A + Z + B, covariates)
x.glm <- aldex.clr(selex.sub, mm, mc.samples=4, denom="all", verbose=T)
glm.test <- aldex.glm(x.glm, mm, fdr.method='holm')
glm.eff<- aldex.glm.effect(x.glm)
The aldex.glm.effect
function will calculate the effect
size for all models in the matrix that are binary. The effect size
calculations for each binary predictor are output to a named list. These
effect sizes, and outputs from aldex.glm
can be plotted and
displayed as in Figure @ref(fig:glm) using the example code below. The
aldex.glm.plot
function, which mirrors the
adlex.plot
function for glm and glm.effect outputs.
# NEW plot the glm.test and glm.eff data for particular contrasts
aldex.glm.plot(glm.test, eff=glm.eff, contrast='B', type='MW', test='fdr')
aldex.kw
moduleAlternatively, the user can perform the Kruskal Wallace tests for
one-way ANOVA of two or more conditions. Here there are only two inputs:
the aldex object from aldex.clr, and the vector of conditions. Note that
this is s.l.o.w! and the aldex.glm()
function should be
used instead. The user is on their own for plotting as this function
should be viewed as unsupported.
While the actual size or scale of the underlying data is lost during
the process of sequencing, we can determine if any of the underlying
results are dependent on scale. This theory behind this new
functionality is outlined in Nixon et al.25 and McGovern et
al.26.
Essentially, we can think of the aldex.clr
function being
able to add both measurement error through Dirichlet Monte-Carlos
sampling of the actual measurements and scale uncertainty through random
sampling of the geometric mean of those measurements. Indeed, as shown
below, we can infer the actual scale difference between groups. Scale
uncertainty increase the robustness of the results, and can be used to
properly centre the data. We recommend using scale instead of using the
denominator adjustment approach given in Wu et al.27. The example below
is only a small sample of the scale functionality, see the
scale_sim
vignette for full description.
We will demonstrate the ALDEx2 scale function in two ways; with
essentially no scale (gamma=1e-3
) and by making a full
scale model. If necessary scale can be ignored completely by setting
gamma=NULL
, in which case previous default behavior is
replicated exactly. The minimally scaled version is simply the outputs
we have been examining until now of the in vitro selection dataset. Here
we have a standard of truth, and a theoretical grounding that the
difference in scale is real. We have a situation where only a minority
of the parts change in both relative and absolute abundance. Prior
analyses(McMurrough et al. 2014) suffered
from a false positive bias because of this, and were not amenable to
analysis with any other tool.
set.seed(11)
data(selex)
selex.sub <- selex[1201:1600,]
conds <- c(rep("NS", 7), rep("S", 7))
# unscaled
x.u <- aldex.clr(selex.sub, conds, mc.samples=16,
gamma=1e-3, verbose=FALSE)
x.u.e <- aldex.effect(x.u, CI=T)
# scaled
# this represents and 8-fold difference in scale between groups
mu.mod <- c(rep(1,7), rep(8,7))
scale.model <- aldex.makeScaleMatrix(gamma=0.5, mu=mu.mod, conditions=conds, mc.samples=16, log=FALSE)
x.ss <- aldex.clr(selex.sub, conds, mc.samples=16,
gamma=scale.model, verbose=FALSE)
x.ss.e <- aldex.effect(x.ss, CI=T)
x.ss.tt <- aldex.ttest(x.ss)
x.ss.all <- cbind(x.ss.e, x.ss.tt)
First, an explanation of the scale model that was made. The
aldex.makeScaleMatrix()
function makes a full scale model
of the data that depends on the ratios between the scales of the two
groups. In the case of the SELEX dataset, we are explicitly modelling an
8-fold difference in scale between the two datasets.
Now let’s look at the unscaled posterior distributions in Figure @ref(fig:scale-pre),
and compare this to Figure @ref(fig:post-scale). The unscaled posteriors are very different, mainly because the control posterior is very narrow. The scaled posterior distributions also very different but are both more dispersed because of the additional uncertainty introduced by the scale model (note the difference in x-axis scales). In addition, the clr values are different because we are supplying the denominator values directly (log2(1) and log2(8) rather than calculating the geometric means of the samples. We stress here that the actual values of the denominator used is irrelevant, instead it is the ratio between the denominators of the two groups that is key.
We can see the effect of scale if we plot the
aldex.effect
outputs from scaled and unscaled data. We see
that the difference within groups is larger, the effect size is smaller
but the difference between groups is unchanged when scale is added as
shown in Figure @ref(fig:comp-scale).
par(mfrow=c(1,3))
plot(x.u.e$diff.btw, x.ss.e$diff.btw, main='diff.btw',
xlab='unscaled diff', ylab='scaled diff', pch=19, col='red')
abline(0,1)
plot(x.u.e$diff.win, x.ss.e$diff.win, main='diff.win',
xlab='unscaled dispersion', ylab='scaled dispersion', pch=19, col='red')
abline(0,1)
plot(x.u.e$effect, x.ss.e$effect, main='effect size',
xlab='unscaled effect', ylab='scaled effect', pch=19, col='red')
abline(0,1)
We can also examine the effect plots, showing those parts where the 95% CI of the effect does not overlap 0 for the unscaled and scaled outputs as an example as in Figure @ref(fig:scale-CI). We see that adding scale uncertainty to the posterior model results in an increase in the dispersion of the data, but changes the difference between measurement very little.
In this dataset we have a standard of truth in an external growth assay for several of the parts displayed here. The scaled data is more similar to what we expect and the S:E:G:D part is the only one displayed on this plot that has appreciable growth in a single growth assay(McMurrough et al. 2014). Thus, adding scale uncertainty removes a large number of false positive identifications, without affecting the identification of the one true positive. Indeed, the true positive stands out more with scale than with no scale uncertainty added.
par(mfrow=c(1,3))
sgn.u <- sign(x.u.e$effect.low) == sign(x.u.e$effect.high)
sgn.s <- sign(x.ss.e$effect.low) == sign(x.ss.e$effect.high)
plot(x.u.e$diff.win, x.u.e$diff.btw, pch=19, cex=0.3,
col="grey",xlab="Dispersion", ylab="Difference",
main="unscaled-CI", xlim=c(0.2,5), ylim=c(-2.5,10))
points(x.u.e$diff.win[abs(x.u.e$effect) >=2], x.u.e$diff.btw[abs(x.u.e$effect) >=2], pch=19, cex=0.5,
col="red")
points(x.u.e$diff.win[sgn.u], x.u.e$diff.btw[sgn.u],
cex=0.7, col="blue")
text(x.u.e['S:E:G:D', 'diff.win'],
x.u.e['S:E:G:D', 'diff.btw']-0.6, labels='S:E:G:D')
abline(0,2, lty=2, col="grey")
abline(0,-2, lty=2, col="grey")
plot(x.ss.e$diff.win, x.ss.e$diff.btw, pch=19, cex=0.3, col="grey",xlab="Dispersion", ylab="Difference", main="scaled-CI",
xlim=c(0.2,5), ylim=c(-2.5,10))
points(x.ss.e$diff.win[abs(x.ss.e$effect) >=2],
x.ss.e$diff.btw[abs(x.ss.e$effect) >=2], pch=19, cex=0.5, col="red")
points(x.ss.e$diff.win[sgn.u], x.ss.e$diff.btw[sgn.u], cex=0.7,
col="blue")
text(x.ss.e['S:E:G:D', 'diff.win'],
x.ss.e['S:E:G:D', 'diff.btw']-0.6, labels='S:E:G:D')
abline(0,2, lty=2, col="grey")
abline(0,-2, lty=2, col="grey")
aldex.plot(x.ss.all, test='welch', main="scaled-BH", xlim=c(0.2,6))
abline(0,2, lty=2, col="grey")
abline(0,-2, lty=2, col="grey")
text(x.ss.all['S:E:G:D', 'diff.win'],
x.ss.all['S:E:G:D', 'diff.btw']-0.6, labels='S:E:G:D')
ALDEx2
returns expected values for summary statistics. It is important to note
that ALDEx uses Bayesian sampling from a Dirichlet distribution to
estimate the underlying technical variation. This is controlled by the
number of mc.samples
, in practice we find that setting this
to 16 or 128 is sufficient for most cases as ALDEx2 is
estimating the expected value of the distributions28.
In practical terms, ALDEx2
takes the biological observations as given, but infers technical
variation (sequencing the same sample again) multiple times using the
aldex.clr
function. Thus, the expected values returned are
those that would likely have been observed . The user is cautioned that
the number of features called as differential will vary somewhat between
runs because of this sampling procedure. However, only features with
values close to the chosen significance cutoff will vary between runs.
The addition of scale allows the model to include potential biological
variation that is lost during the sequencing library and data
acquisition process.
Several papers have suggested that ALDEx2 is unable to properly control for the false discovery rate since the p-values returned do not follow a random uniform distribution, but rather tend to cluster near a value of 0.529. This is expected behaviour.
These studies also show that point estimate approaches are very sensitive to particular experimental designs and differences in sparsity and read depth. However, ALDEx2 is not sensitive to these characteristics of the data, but seem to under-report the true FDR. The criticisms misses the mark on ALDEx2 because ALDEx2 reports the p-value across the Dirichlet Monte-Carlo replicates. Features that are differential simply because of the vagaries of random sampling will indeed have a random uniform p-value as point estimates, but will have an expected p-value after repeated random sampling of 0.5. In contrast, features that are differential because of true biological variation are robust to repeated random sampling. Thus, ALDEx2 identifies as differential only those features where simple random sampling (the minimal NULL hypothesis) cannot explain the difference.
In our experience, we observe that ALDEx2 returns a set of features that is very similar to the set returned as the intersect of multiple independent tools—a common recommendation when examining HTS datasets30
aldex.effect(x)
(rowname) | rab.all | rab.win.NS | rab.win.S | diff.btw | diff.win | effect | overlap |
---|---|---|---|---|---|---|---|
A:D:A:D | 1.42494 | 1.30886 | 2.45384 | 1.12261 | 1.72910 | 0.471043 | 0.267260701 |
A:D:A:E | 1.71230 | 1.49767 | 4.23315 | 2.73090 | 2.38134 | 1.034873 | 0.135857781 |
A:E:A:D | 3.97484 | 1.41163 | 11.02154 | 9.64287 | 2.85008 | 3.429068 | 0.000156632 |
aldex.ttest(x)
(rowname) | we.ep | we.eBH | wi.ep | wi.eBH |
---|---|---|---|---|
A:D:A:D | 4.030e-01 | 0.63080 | 0.239383 | 0.43732 |
A:D:A:E | 1.154e-01 | 0.34744 | 0.040901 | 0.15725 |
A:E:A:D | 8.987e-05 | 0.00329 | 0.000582 | 0.00820 |
In the list below, the aldex.ttest
function returns the
values highlighted with * and the
aldex.effect
function returns the values highlighted with
♢. Note that when scale is applied that
the clr is only used with the default scale model, and the log-ratio for
the full scale model is calculated with a user-defined denominator
instead of the geometric mean of each sample. In practice, the ratio
between the denominators is more important than the actual way the
denominator is calculated.
include.item.summary=TRUE
aldex.glm(x)
The output from the aldex.glm
function is somewhat
different, although it still returns the expected values of the test
statistics.
For example, using the selex
dataset and the model
matrix from the aldex.glm
help information:
conds <- data.frame("A" = sample(0:1, 14, replace = TRUE),
"B" = c(rep(0, 7), rep(1, 7)))
mm <- model.matrix(~ A + B, conds)
x <- aldex.clr(selex, mm, mc.samples=4)
glm.test <- aldex.glm(x)
glm.eff <- aldex.glm.effect(x)
In this example the column names from
head(glm.test)
:
Note that we have generated both a random model
A'' and the correct model
B’’. We anticipate that model A
will give no statistically significant outputs, and that B will give
similar results to a t-test, and indeed this is the case.
The outputs are expected values of the intercept and its
coefficients, and the estimates for each model and their coefficients.
Also calculated is the Family Wide Error Rate (FWER) for each p-value
(by default reported in the model.x Pr(>|t|).holm
output. This is a Holm-Bonferroni FWER correction that is more powerful
than the Bonferroni correction, and is a step-down correction that is
more stringent than a false discovery rate correction such as
Benjamini-Hochberg. Note that for a generalized linear model, Tukey’s
HSD is not applicable to control the FWER. This is because Tukey’s HSD
depends on an ANOVA in the background in R and packages that calculate
Tukey’s HSD essentially all do an aov
, thus the FWER is
calculated on a different test statistic than is calculated by the
glm.
The expected value of the Holm-Bonferroni test can be used as the post-hoc standin for Tukey’s HSD. Note, that this is true only when there are a relatively small number of contrasts (up to 10) in the model31. For convenience the user can also select the Benjamini-Hochberg correction.
The effect size metric used by ALDEx2 is a standardized distributional effect size metric developed specifically for this package. The measure is somewhat robust, allowing up to 20% of the samples to be outliers before the value is affected, returns an effect size that is 71% the size of Cohen’s d on a Normal distribution, and requires at worst twice the number of samples as fully parametric methods (which are not robust) would to estimate values with the same precision. The metric is equally valid for Normal, random uniform and Cauchy distributions^(Andrew D. Fernandes et al. (2019) submitted).
We prefer to use the effect size whenever possible rather than statistical significance since an effect size tells the scientist what they want to know—“what is reproducibly different between groups”; this is not something that p-values deliver. We find that using the effect size returns a consistent set of true positive features irregardless of sample size, unlike p-value based methods. Furthermore, over half of the the false positive features that are observed at low sample sizes have and effect size > 0.5 × E the chosen effect size cutoff E. This is true regardless of the source of the dataset (Andrew D. Fernandes et al. (2019) submitted).
We suggest that an effect size cutoff of 1 or greater be used when analyzing HTS datasets. If preferred the user can also set a fold-change cutoff as is commonly done with p-value based methods, and volcano plots are now included as an option to help guide the user.
The built-in aldex.plot function described above will usually be sufficient, but for more user control the example in Figure 10 shows a plot that shows which features are found by the Welchs’ or Wilcoxon test individually (blue) or by both (red).
# identify which values are significant in both the t-test and glm tests
found.by.all <- which(x.all$we.eBH < 0.05 & x.all$wi.eBH < 0.05)
# identify which values are significant in fewer than all tests
found.by.one <- which(x.all$we.eBH < 0.05 | x.all$wi.eBH < 0.05)
# plot the within and between variation of the data
plot(x.all$diff.win, x.all$diff.btw, pch=19, cex=0.3, col=rgb(0,0,0,0.3),
xlab="Dispersion", ylab="Difference")
points(x.all$diff.win[found.by.one], x.all$diff.btw[found.by.one], pch=19,
cex=0.7, col=rgb(0,0,1,0.5))
points(x.all$diff.win[found.by.all], x.all$diff.btw[found.by.all], pch=19,
cex=0.7, col=rgb(1,0,0,1))
abline(0,1,lty=2)
abline(0,-1,lty=2)
In some cases we observe that the data can be asymmetric. This occurs when the data are extremely asymmetric, such as when one group is largely composed of features that are absent in the other group or when the direction of change in the experiment is not symmetric. In this case the geometric mean will not accurately represent the appropriate basis of comparison for each group. An asymmetry can arise for many reasons: in RNA-seq it could arise because samples in one group contain a plasmid and the samples in the other group do not; in metagenomics or 16S rRNA gene sequencing it can arise when the samples in the two groups are taken from different environments; in a selex experiment it can arise because the two groups are under different selective constraints. The asymmetry can manifest either as a difference in sparsity (i.e., one group contains more 0 value features than the other) or as a systematic difference in abundance. When this occurs the geometric mean of each sample and group can be markedly different, and thus an inherent skew in the dataset can occur that leads to false positive and false negative feature calls. Asymmetry generally shows as a the centre of mass of the histogram for the x.all$diff.btw or x.all$effect being not centred around zero32. We recommend that all datasets be examined for asymmetry.
The preferred approach is to build a full scale model of the
asymmetry and to use that as an input to aldex.clr
.
Fundamentally, asymmetry will often arise because of a difference in
size of the underlying environment. For example, one cell type may grow
faster than another, or may have a different total number of mRNA
molecules than another, or one environment may have a different
microbial carrying capacity than another. This is the situation that is
best taken care of by a scale model33.
The example dataset below is the simulated synth2 dataset from from Wu et al(Wu et al. 2021). In this example there are 1000 features with 20 (2%) of those being asymmetrically sparse; i.e., 20 features are 0 in one group and non-zero in the other. A further 50 have an approximate 2-fold change (FC is normally distributed). The asymmetry has the effect of changing the geometric mean of the two groups. In addition, the data have an unrealistically low dispersion.
The three plots show the behaviour of unmodified and scale corrected
ALDEx2
.
The base output shows that some features are differentially abundant because of the inappropriate centre of the data. The approximate centre is the median difference of the points and is noted.
Adding in uncertainty about the scale of the data has a minimal effect on the median difference, but does remove all false positives. However, now there is an asymmetry in the false negatives, in that those in the ‘S’ category are more likely to be called negatives than those in the ‘N’ category.
Finally, adding in uncertainty about scale, and information on the fraction of the features that are asymmetric results in the desired output.
We recommend the scale approach whenever possible, especially if
there is some underlying biology that could drive the asymmetry. See the
scale_sim
vignette for more information on these new
capabilities.
The default ALDEx2
scale correction includes only
uncertainty and not asymmetry.
set.seed(11)
data(synth2)
# basic ALDEx2
blocks <- c(rep("N", 10),rep("S", 10))
x <- aldex.clr(synth2, blocks, gamma=NULL)
x.e <- aldex.effect(x)
# Add in asymmetry and scale variance
# gamma can be smaller for the same output dispersion
# in the full scale model built here
# make a base scale for group 1 and 2
mu.vec = c(log2(rep(1,10)), log2(rep(1.02,10)))
scale_samples <- aldex.makeScaleMatrix(gamma=0.25, mu=mu.vec,
conditions=blocks)
xs <- aldex.clr(synth2, blocks, gamma=scale_samples)
xs.e <- aldex.effect(xs)
# Add in only scale variance with sd=1
xg <- aldex.clr(synth2, blocks, gamma=0.5)
xg.e <- aldex.effect(xg)
par(mfrow=c(1,3))
aldex.plot(x.e, test='effect', main='base', xlim=c(0.1,3))
text(1, 4, labels=paste('median diff =',round(median(x.e$diff.btw),3)))
aldex.plot(xg.e, test='effect', main='ratio=1, sd=0.5', xlim=c(0.1,3))
text(1.6, 4, labels=paste('median diff =',round(median(xg.e$diff.btw),3)))
aldex.plot(xs.e, test='effect', main='ratio=1.02:1, sd=0.5', xlim=c(0.1,3))
text(1.6, 4, labels=paste('median diff =',round(median(xs.e$diff.btw),3)))
The other approach that can be taken by ALDEx2 is
to identify those features that are relatively invariant in all features
in the entire dataset even though many features could be asymmetric
between the groups. Fundamentally, the log-ratio approach requires that
the denominator across all samples be comparable. The output of
aldex.clr
contains the offset of the features used for the
denominator in the @denom
slot.
These methods should be considered deprecated and may be removed in future releases since their outcomes can be completely reproduced using scale models which are much more easily explained and which provide insights into the underlying environment from which the data were collected.
denom
The aldex.clr
function incorporates multiple approaches
to select the denominator that can best deal with asymmetric
datasets:
IMPORTANT: no rows should contain all 0 values as they will be
removed by the aldex.clr
function
all: The default is to calculate the geometric mean of all features using the centred log-ratio of Aitchison34. This is the default method for the compositional data analysis approach.
iqlr: The iqlr method identifies those features that exhibit reproducible variance in the entire dataset. This is called the inter-quartile log-ratio or iqlr approach. For this, a uniform prior of 0.5 is applied to the dataset, the clr transformation is applied, and the variance of each feature is calculated. Those features that have variance values that fall between the first and third quartiles of variance for all features in all groups in the dataset are retained. When aldex.clr is called, the geometric mean abundance of only the retained features is calculated and used as the denominator for log-ratio calculations. Modelling shows that this approach is effective in dealing with datasets with up to 25% of the features being asymmetric. The approach has the advantage it has little or no effect on symmetric datasets and so is a safe approach if the user is unsure if the data is mildly asymmetric.
lvha: This method identifies those features that in the bottom quartile for variance in each group and the top quartile for relative abundance for each sample and across the entire dataset. This method is appropriate when the groups are very asymmetric, but there are some features that are expected to be relatively constant. The basic idea here is to identify those features that are relatively constant across all samples, similar to the features that would be chosen as internal standards in qPCR. Experience suggests that meta-genomic and meta-transcriptomic datasets can benefit from this method of choosing the denominator. This method does not work with the selex dataset, since no features fit the criteria.
zero: This approach identifies and uses only those
features that are non-zero in each group. In this approach the per-group
non-zero features are used when aldex.clr
calculates the
geometric mean in the clr transformation. This method is appropriate
when the groups are very asymmetric, but the experimentalist must ask
whether the comparison is valid in these extreme cases.
user: The last new approach is to let the user define the set of `invariant’ features. In the case of meta-rna-seq, it could be argued that the levels of housekeeping genes should be standard for all samples. In this case the user could define the row indices that correspond to the particular set of housekeeping genes to use as the standard.
iterate: This method identifies those features that are not statistically significantly different between groups using the statistical test of choice. It may be combined with other methods.
On occasion ALDEx2 may overflow the memory available. This is seen when very large datasets containing thousands of samples and tens of thousands of features are used.
The workaround is simple. Run the 'aldex
function with
MC.samples=2
multiple times, and then take the average of
the values that you want to use for inference. We recommend taking the
mean of at least 16 such replicates. Testing shows that this
recapitulates the expected result, and is not (too far off) what ALDEx2
is doing under the hood.
I am grateful that ALDEx2 has taken on a life of its own.
scale_sim
vignette. More scale
corrections will be forthcoming.. No further changes are expected for that version since it can be
replicated completely within ALDEx2 by
using only the aldex.clr
and aldex.effect
commands.
Versions 2.0 to 2.05 were development versions that enabled p-value calculations. Version 2.06 of ALDEx2 was the version used for the analysis in37. This version enabled large sample comparisons by calculating effect size from a random sample of the data rather than from an exhaustive comparison.
Version 2.07 of ALDEx2 was the initial the modular version that exposed the intermediate calculations so that investigators could write functions to analyze different experimental designs. As an example, this version contains an example one-way ANOVA module. This is identical to the version submitted to Bioconductor as 0.99.1.
Future releases of ALDEx2 now use the Bioconductor versioning numbering.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ALDEx2_1.39.0 latticeExtra_0.6-30 lattice_0.22-6
## [4] zCompositions_1.5.0-4 truncnorm_1.0-9 NADA_1.6-1.1
## [7] survival_3.7-0 MASS_7.3-61 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.35.5 xfun_0.48
## [3] bslib_0.8.0 Biobase_2.65.1
## [5] quadprog_1.5-8 tools_4.4.1
## [7] stats4_4.4.1 parallel_4.4.1
## [9] highr_0.11 Matrix_1.7-1
## [11] RColorBrewer_1.1-3 S4Vectors_0.43.2
## [13] RcppParallel_5.1.9 lifecycle_1.0.4
## [15] GenomeInfoDbData_1.2.13 compiler_4.4.1
## [17] deldir_2.0-4 codetools_0.2-20
## [19] GenomeInfoDb_1.41.2 htmltools_0.5.8.1
## [21] sys_3.4.3 buildtools_1.0.0
## [23] sass_0.4.9 yaml_2.3.10
## [25] crayon_1.5.3 jquerylib_0.1.4
## [27] BiocParallel_1.39.0 cachem_1.1.0
## [29] DelayedArray_0.31.14 abind_1.4-8
## [31] digest_0.6.37 maketools_1.3.1
## [33] splines_4.4.1 fastmap_1.2.0
## [35] grid_4.4.1 cli_3.6.3
## [37] SparseArray_1.5.45 S4Arrays_1.5.11
## [39] Rfast_2.1.0 UCSC.utils_1.1.0
## [41] RcppZiggurat_0.1.6 rmarkdown_2.28
## [43] XVector_0.45.0 httr_1.4.7
## [45] matrixStats_1.4.1 jpeg_0.1-10
## [47] multtest_2.61.0 interp_1.1-6
## [49] png_0.1-8 evaluate_1.0.1
## [51] knitr_1.48 GenomicRanges_1.57.2
## [53] IRanges_2.39.2 rlang_1.1.4
## [55] Rcpp_1.0.13 BiocManager_1.30.25
## [57] directlabels_2024.1.21 BiocGenerics_0.51.3
## [59] jsonlite_1.8.9 R6_2.5.1
## [61] MatrixGenerics_1.17.1 zlibbioc_1.51.2
all high throughput sequencing data are compositional (Gregory B. Gloor et al. 2017; Gregory B. Gloor 2023) because of constraints imposed by the instruments↩︎
not accounting for scale leads to both false positive and false negative inference and is also a principled method of accounting for asymmetry in the underlying environment Nixon et al. (2023)↩︎
Macklaim et al. (2013)↩︎
T. P. Quinn, Crowley, and Richardson (2018)↩︎
Bian et al. (2017)↩︎
McMurrough et al. (2014);Wolfs et al. (2016)↩︎
Gregory B. Gloor (2023)↩︎
Andrew D. Fernandes et al. (2014)↩︎
Nixon et al. (2023)↩︎
and all other normalizations which are based on ratios — which is to say all of them↩︎
Marioni et al. (2008)↩︎
Andrew D. Fernandes et al. (2014);Gregory B. Gloor et al. (2016)↩︎
Nixon et al. (2023)↩︎
Van Zwet and Oosterhoff (1967)↩︎
Aitchison (1986)↩︎
Nixon et al. (2023)↩︎
McMurrough et al. (2014)↩︎
Altman and Bland (1983)↩︎
Gregory B. Gloor, Macklaim, and Fernandes (2016)↩︎
Cui and Churchill (2003)↩︎
in fact we recommend that the number of samples in the smallest group multiplied by the number of DMC be equal at least 1000 in order to generate a reasonably stable estimate of the posterior distribution↩︎
Van Zwet and Oosterhoff (1967)↩︎
Macklaim et al. (2013);McMurrough et al. (2014);Bian et al. (2017)↩︎
Fernandes, Gloor unpublished↩︎
Nixon et al. (2023)↩︎
McGovern, Nixon, and Silverman (2023)↩︎
Wu et al. (2021)↩︎
Andrew D. Fernandes et al. (2013);Andrew D. Fernandes et al. (2014);Gregory B. Gloor et al. (2016)↩︎
Hawinkel et al. (2018);Thorsen et al. (2016)↩︎
Soneson and Delorenzi (2013)↩︎
Kim (2015)↩︎
Wu et al. (2021)↩︎
Nixon et al. (2023);McGovern, Nixon, and Silverman (2023)↩︎
Aitchison:1986↩︎
T. Quinn et al. (2017)↩︎
Macklaim et al. (2013);Andrew D. Fernandes et al. (2013)↩︎
Andrew D. Fernandes et al. (2014)↩︎