Let’s look at the airway dataset as an example of a typical small-scale RNA-Seq experiment. In this experiment, four Airway Smooth Muscle (ASM) cell lines are treated with the asthma medication dexamethasone.
The function weitrix_calibrate_all
will be used to
assign precision weights to log Counts Per Million values.
library(weitrix)
library(ComplexHeatmap)
library(EnsDb.Hsapiens.v86)
library(edgeR)
library(limma)
library(reshape2)
library(tidyverse)
library(airway)
set.seed(1234)
# BiocParallel supports multiple backends.
# If the default hangs or errors, try others.
# The most reliable backed is to use serial processing
BiocParallel::register( BiocParallel::SerialParam() )
## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
Initial steps are the same as for a differential expression analysis.
counts <- assay(airway,"counts")
design <- model.matrix(~ dex + cell, data=colData(airway))
good <- filterByExpr(counts, design=design)
table(good)
## good
## FALSE TRUE
## 46817 16860
airway_dgelist <- DGEList(counts[good,]) %>% calcNormFactors()
airway_lcpm <- cpm(airway_dgelist, log=TRUE, prior.count=1)
log2 CPM values have been calculated, with an added “prior” count of (on average) 1 read, so that zeros aren’t sent to negative infinity.
airway_weitrix <- as_weitrix(airway_lcpm)
# Include row and column information
colData(airway_weitrix) <- colData(airway)
rowData(airway_weitrix) <-
mcols(genes(EnsDb.Hsapiens.v86))[rownames(airway_weitrix),c("gene_name","gene_biotype")]
airway_weitrix
## class: SummarizedExperiment
## dim: 16860 8
## metadata(1): weitrix
## assays(2): x weights
## rownames(16860): ENSG00000000003 ENSG00000000419 ... ENSG00000273487
## ENSG00000273488
## rowData names(2): gene_name gene_biotype
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
To calibrate, we need predicted expression levels so we can calculate
residuals. The function weitrix_components
can provide
linear model fits to each gene. (We will see a more advanced use of this
function later.)
Currently the weitrix has weights uniformly equal to 1. Examining residuals, we see a variance trend relative to the linear model prediction.
Each dot in the plot below is a residual weighted by sqrt(weight). The x-axis is the linear model prediction. The y-axis is the weighted residual (all weights are currently 1). The red lines show the mean and the mean +/- one standard deviation.
We will use the function weitrix_calibrate_all
to set
the weights by fitting a gamma GLM with log link function to the
weighted squared residuals. 1 over the predictions from this GLM are
used as weights. Here we fit a natural spline based on the linear model
predictions from fit
, which are referred to in the formula
as mu
. well_knotted_spline
is a wrapper around
splines::ns
for natural splines with improved choice of
knot locations.
airway_cal <- weitrix_calibrate_all(
airway_weitrix,
design = fit,
trend_formula = ~well_knotted_spline(mu,5))
## mu range -5.552401 14.303375 knots -0.8637096 1.4763752 3.6914433 5.6106759 8.0243479
## $weitrix
## $weitrix$x_name
## [1] "x"
##
## $weitrix$weights_name
## [1] "weights"
##
## $weitrix$all_coef
## (Intercept) well_knotted_spline(mu, 5)1
## -0.4307917 -3.0128598
## well_knotted_spline(mu, 5)2 well_knotted_spline(mu, 5)3
## -4.0044737 -4.6128293
## well_knotted_spline(mu, 5)4 well_knotted_spline(mu, 5)5
## -3.3897229 -5.2701723
## well_knotted_spline(mu, 5)6
## -3.1086780
The trend lines (red) for the calibrated weitrix are now uniformly close to 1, 0, -1 (guide lines shown in blue). Weights can now be treated as inverse residual variance.
Another way to check this is to plot the unweighted residuals against 1/sqrt(weight), i.e. the residual standard deviation.
We can also check each sample individually.
If there was an odd sample, we could use a more complex trend formula
such as ~col*well_knotted_spline(mu,4)
:
While the precise details differ, what we have done is very similar
to the voom
function in limma.
weitrix_calplot
can be used with voomed data as
well.
voom
achieves a similar result to
weitrix_calibrate_all
(but note the input to
voom
is a DGEList of counts, not a matrix that is already
log transformed). limma EList objects can be converted to weitrix
objects with as_weitrix
. Weitrix objects can be converted
to limma EList objects with weitrix_elist
.
RNA-Seq expression is well trodden ground. The main contribution of the weitrix package here is to aid exploration by discovering components of variation, providing not just column scores but row loadings and respecting precision weights.
weitrix_sd_confects
will find genes with excess
variation in a calibrated weitrix.
## $table
## confect effect row_mean typical_obs_err dispersion n_present df fdr_zero
## 1 2.620 3.121 6.631 0.2930 114.49 8 7 0
## 2 2.608 3.020 4.731 0.2495 147.49 8 7 0
## 3 2.368 2.785 3.850 0.2573 118.16 8 7 0
## 4 2.076 2.484 4.006 0.2557 95.41 8 7 0
## 5 2.009 2.271 8.249 0.1658 188.60 8 7 0
## 6 1.993 2.477 2.849 0.3102 64.79 8 7 0
## 7 1.905 2.208 4.396 0.1961 127.83 8 7 0
## 8 1.893 2.334 3.054 0.2875 66.93 8 7 0
## 9 1.892 2.145 6.065 0.1657 168.64 8 7 0
## 10 1.856 2.114 5.431 0.1702 155.23 8 7 0
## name gene_name gene_biotype
## 1 ENSG00000229807 XIST lincRNA
## 2 ENSG00000262902 MTCO1P40 processed_pseudogene
## 3 ENSG00000171819 ANGPTL7 protein_coding
## 4 ENSG00000243137 PSG4 protein_coding
## 5 ENSG00000123243 ITIH5 protein_coding
## 6 ENSG00000198732 SMOC1 protein_coding
## 7 ENSG00000162817 C1orf115 protein_coding
## 8 ENSG00000235750 KIAA0040 protein_coding
## 9 ENSG00000096060 FKBP5 protein_coding
## 10 ENSG00000204941 PSG5 protein_coding
## ...
## 6216 of 16860 non-zero excess standard deviation at FDR 0.05
“effect” is root-mean-square variation in residuals relative to a
fitted model in excess of what is expected from the calibrated
weights. Here the model only has an intercept term, so the residuals
represent variation from the weighted mean. “confect” scores are lower
confidence bounds on the effect, adjusted for multiple testing using the
topconfects
method.
The default method assumes errors are normally distributed. If you have a large number of columns, this assumption can be relaxed. This is more appropriate for single cell data, which will typically have a large number of cells, and also not have normally distributed errors due to the sparse nature of the data. In single-cell experiments, this should be a good way to find marker genes.
## $table
## confect effect row_mean typical_obs_err dispersion n_present df fdr_zero
## 1 2.4670 3.0204 4.73096 0.2495 147.49 8 7 2.569e-05
## 2 1.3842 2.3499 0.04261 0.5470 19.45 8 7 2.799e-03
## 3 0.7603 2.3745 1.94467 0.3821 39.62 8 7 1.468e-02
## 4 0.7603 2.1681 2.92166 0.2709 65.05 8 7 1.468e-02
## 5 0.7603 2.3265 2.71176 0.3267 51.70 8 7 1.468e-02
## 6 0.7141 2.1156 -0.41319 0.5607 15.24 8 7 1.468e-02
## 7 0.6611 2.3662 0.08127 0.5422 20.05 8 7 1.632e-02
## 8 0.6229 2.0476 2.12185 0.3153 43.17 8 7 1.632e-02
## 9 0.6189 2.2522 1.21883 0.4135 30.66 8 7 1.800e-02
## 10 0.3847 0.7536 8.80031 0.1657 21.68 8 7 1.468e-02
## name gene_name gene_biotype
## 1 ENSG00000262902 MTCO1P40 processed_pseudogene
## 2 ENSG00000221818 EBF2 protein_coding
## 3 ENSG00000144191 CNGA3 protein_coding
## 4 ENSG00000247627 MTND4P12 processed_pseudogene
## 5 ENSG00000178919 FOXE1 protein_coding
## 6 ENSG00000261239 ANKRD26P1 transcribed_unprocessed_pseudogene
## 7 ENSG00000226278 PSPHP1 unprocessed_pseudogene
## 8 ENSG00000248810 RP11-362F19.1 lincRNA
## 9 ENSG00000223722 RP11-467L13.5 processed_pseudogene
## 10 ENSG00000148175 STOM protein_coding
## ...
## 39 of 16860 non-zero excess standard deviation at FDR 0.05
Top genes can then be examined to find a reason for their variation. For example, we see that XIST is highly expressed in a particular cell type.
The code below will find various numbers of components, from 1 to 6. In each case, the components discovered have varimax rotation applied to their gene loadings to aid interpretability. The result is a list of Components objects.
## [[1]]
## Components are: (Intercept), C1
## $row : 16860 x 2 matrix
## $col : 8 x 2 matrix
## $R2 : 0.4046082
##
## [[2]]
## Components are: (Intercept), C1, C2
## $row : 16860 x 3 matrix
## $col : 8 x 3 matrix
## $R2 : 0.6364208
##
## [[3]]
## Components are: (Intercept), C1, C2, C3
## $row : 16860 x 4 matrix
## $col : 8 x 4 matrix
## $R2 : 0.8055695
##
## [[4]]
## Components are: (Intercept), C1, C2, C3, C4
## $row : 16860 x 5 matrix
## $col : 8 x 5 matrix
## $R2 : 0.9096617
##
## [[5]]
## Components are: (Intercept), C1, C2, C3, C4, C5
## $row : 16860 x 6 matrix
## $col : 8 x 6 matrix
## $R2 : 0.9500206
##
## [[6]]
## Components are: (Intercept), C1, C2, C3, C4, C5, C6
## $row : 16860 x 7 matrix
## $col : 8 x 7 matrix
## $R2 : 0.9778973
We can compare the proportion of variation explained to what would be explained in a completely random weitrix. Random normally distributed values are generated with variances equal to one over the weights.
Up to 4 components may be justified.
comp <- comp_seq[[4]]
comp$col[,-1] %>% melt(varnames=c("Run","component")) %>%
left_join(as.data.frame(colData(airway)), by="Run") %>%
ggplot(aes(y=cell, x=value, color=dex)) +
geom_vline(xintercept=0) +
geom_point(alpha=0.5, size=3) +
facet_grid(~ component) +
labs(title="Sample scores for each component", x="Sample score", y="Cell line", color="Treatment")
If varimax rotation isn’t used, weitrix_components
and
weitrix_components_seq
will produce a Principal Components
Analysis, with components ordered from most to least variance
explained.
Without varimax rotation the treatment effect is still mostly in the first component, but has also leaked a small amount into the other components.
comp_nonvarimax <- weitrix_components(airway_cal, p=4, use_varimax=FALSE)
comp_nonvarimax$col[,-1] %>% melt(varnames=c("Run","component")) %>%
left_join(as.data.frame(colData(airway)), by="Run") %>%
ggplot(aes(y=cell, x=value, color=dex)) +
geom_vline(xintercept=0) +
geom_point(alpha=0.5, size=3) +
facet_grid(~ component) +
labs(title="Sample scores for each component, no varimax rotation", x="Sample score", y="Cell line", color="Treatment")
col
can potentially be used as a design matrixIf you’re not sure of the experimental design, for example the exact timing of a time series or how evenly a drug treatment was applied, the extracted component might actually be more accurate.
Note that this ignores uncertainty about the col
matrix
itself.
This may be useful for hypothesis generation – finding some potentially interesting genes, while discounting noisy or lowly expressed genes – but don’t use it as proof of significance.
First by the topconfects method. This will find the largest confident effect sizes, while still correcting for multiple testing.
## $table
## confect effect se df fdr_zero row_mean typical_obs_err
## 1 6.606 12.075 0.5919 9.195 2.653e-06 2.311 0.2012
## 2 6.463 11.783 0.6258 9.195 3.736e-06 4.755 0.2381
## 3 5.997 7.957 0.2423 9.195 3.073e-07 4.746 0.1522
## 4 5.956 9.320 0.4307 9.195 1.907e-06 3.850 0.2540
## 5 5.746 7.873 0.2799 9.195 4.763e-07 5.405 0.1935
## 6 5.510 6.586 0.1466 9.195 7.303e-08 7.758 0.1171
## 7 5.510 8.310 0.3843 9.195 1.907e-06 3.399 0.2057
## 8 5.236 8.871 0.5074 9.195 5.201e-06 4.021 0.2664
## 9 5.069 9.497 0.6276 9.195 1.135e-05 2.014 0.2745
## 10 -5.062 -6.453 0.1999 9.195 3.073e-07 4.648 0.1340
## name gene_name gene_biotype
## 1 ENSG00000179593 ALOX15B protein_coding
## 2 ENSG00000109906 ZBTB16 protein_coding
## 3 ENSG00000163884 KLF15 protein_coding
## 4 ENSG00000171819 ANGPTL7 protein_coding
## 5 ENSG00000152583 SPARCL1 protein_coding
## 6 ENSG00000101347 SAMHD1 protein_coding
## 7 ENSG00000168309 FAM107A protein_coding
## 8 ENSG00000127954 STEAP4 protein_coding
## 9 ENSG00000250978 RP11-357D18.1 processed_transcript
## 10 ENSG00000162692 VCAM1 protein_coding
## ...
## 5540 of 16860 non-zero contrast at FDR 0.05
## Prior df 6.2
If you prefer limma and p-values:
airway_elist <- weitrix_elist(airway_cal)
fit <-
lmFit(airway_elist, comp$col) %>%
eBayes()
fit$df.prior
## [1] 6.19524
## [1] 0.5946888
## gene_name gene_biotype logFC AveExpr t
## ENSG00000101347 SAMHD1 protein_coding 6.585680 8.135498 44.90834
## ENSG00000189221 MAOA protein_coding 5.730034 5.951528 34.04168
## ENSG00000178695 KCTD12 protein_coding -4.417195 6.452192 -31.72330
## ENSG00000139132 FGD4 protein_coding 3.809042 5.416374 31.54337
## ENSG00000179094 PER1 protein_coding 5.479478 4.423081 30.43529
## ENSG00000120129 DUSP1 protein_coding 5.120892 6.644985 29.38798
## ENSG00000162692 VCAM1 protein_coding -6.453409 3.576221 -32.29109
## ENSG00000124766 SOX4 protein_coding -4.273940 5.427907 -28.70344
## ENSG00000163884 KLF15 protein_coding 7.957297 3.249262 32.84342
## ENSG00000165995 CACNB2 protein_coding 5.717502 3.686989 29.63384
## P.Value adj.P.Val B
## ENSG00000101347 4.331667e-12 7.303191e-08 17.48757
## ENSG00000189221 5.456200e-11 3.073475e-07 15.61359
## ENSG00000178695 1.038475e-10 3.073475e-07 15.14447
## ENSG00000139132 1.093763e-10 3.073475e-07 15.08271
## ENSG00000179094 1.515295e-10 3.649695e-07 14.56555
## ENSG00000120129 2.084677e-10 3.905294e-07 14.51588
## ENSG00000162692 8.833396e-11 3.073475e-07 14.50721
## ENSG00000124766 2.583657e-10 4.356046e-07 14.30694
## ENSG00000163884 7.567087e-11 3.073475e-07 14.29881
## ENSG00000165995 1.932349e-10 3.905294e-07 14.13861