The spqn package contains an implementation of spatial quantile normalization (SpQN) as described in (Wang, Hicks, and Hansen 2020). In this work we describe how correlation matrices estimated from gene expression data exhibits a mean-correlation relationship, in which highly expressed genes appear more correlated than lowly expressed genes. This relationship is not observed in protein-protein interaction networks, which suggests it is technical rather than biological. To correct for this bias in estimation, we developed SpQN, a method for normalizing the correlation matrix. In our work, we show how correcting for this relationship increases the correlations involving transcription factors, an important class of gene regulators.
We believe spatial quantile normalization has applications outside of co-expression analysis including beyond correlation matrices. However, the implementation in this package is currently focused on correlation matrices.
The package contains various plotting functions for assessing the mean-correlation relationship, mirroring plots in (Wang, Hicks, and Hansen 2020).
Let’s load the package and some example data.
This dataset contains 4,000 genes from the GTEx tissue “Adipose Subcutaneous” which we use as exemplar tissue in our preprint (Wang, Hicks, and Hansen 2020). The data is from GTEx v6p. The genes have been restricted to a random sample of 4,000 genes from genes which are
The data is counts (from the GTEx website) of bulk RNA-seq data which
has been transformed to the expression matrix in the log2(RPKM) scale. Each gene has
been scaled to have mean 0 and variance 1, and 4 principal components
have been removed from the data matrix. For SpQN we need access to the
gene expression level, which we take as the mean or median expression
level across samples. So this quantity needs to be computed (and stored)
prior to removing principal components. For this example data, we have
stored the average expression value on the log-RPKM scale in
rowData(gtex.4k)$ave_logrpkm
.
To compare the distribution of correlation across different expression levels, we get the correlation matrix and average gene expression level (log2(RPKM)) for each gene.
Note: code in this section is not being evaluated.
We assume you have counts (or something similar) stored in a
SummarizedExperiment
. In our case, we have downloaded the
GTEx v6p counts from the GTEx portal website (code is available in the
spqnData
package inside scripts
).
It makes sense to do some correction for library size. In this data we have simply divided by the number of reads in millions. There are multiple normalization strategies for estimating a library size correction factor. The literature shows that using such methods are essential when comparing expression across different tissues / cell types / conditions. However, here we are looking at correlation patterns within tissue.
We also need to remove genes which are not expressed. The threshold we have been using for GTEx is fairly liberal as it removes less than 50% of protein-coding codes. We apply this cutoff on the log2RPKM scale as it is necessary to consider gene length when applying a cutoff across genes. However, the exact way of doing this is probably not too important.
Finally, following (Parsana et al.
2019) we strongly recommend removing principal components from
the data matrix prior to analysis. A fast convenience function for this
is removePrincipalComponents()
from the WGCNA package. How
many principal components to remove is an open question. In (Freytag et al. 2015) the authors recommend
looking at control genes to assess this, under the hypothesis that
random genes should be uncorrelated and we have existing groups of genes
which are believe to be highly correlated in all tissues, such as the
PRC2 complex (Freytag et al. 2015) or the
ribosomal RNA genes (Boukas et al. 2019).
Another approach is to use the num.sv()
function from the
SVA package. In our experience, this yields a substantially higher
number of PCs than the graphical methods. Finally, as we shown (Wang, Hicks, and Hansen 2020), the QC plots we
present below might also help making this decision. At the end of the
day, it is still an open problem.
Example code is as follows, the starting point is a
SummarizedExperiment
called gtex
with the
relevant data. It has two relevant columns in
rowData(gtex)
. One column is gene_length
and
another is gene_type
.
# only keep coding genes and lincRNA in the counts matrix
gtex <- gtex[rowData(gtex)$gene_type %in%
c("lincRNA", "protein_coding"), ]
# normalize the counts matrix by log2rpkm
cSums <- colSums(assay(gtex))
logrpkm <- sweep(log2(assay(gtex) + 0.5), 2, FUN = "-",
STATS = log2(cSums / 10^6))
logrpkm <- logrpkm - log2(rowData(gtex)$gene_length / 1000)
# only keep those genes with median log2rpkm above 0
wh.expressed <- which(rowMedians(logrpkm) > 0)
gtex.0pcs <- gtex[wh.expressed,]
logrpkm.0pcs <- logrpkm[wh.expressed,]
ave_logrpkm <- rowMeans(logrpkm.0pcs)
logrpkm <- logrpkm - ave_logrpkm # mean centering
logrpkm <- logrpkm / matrixStats::rowSds(logrpkm) # variance scaling
assays(gtex.0pcs) <- SimpleList(logrpkm = logrpkm.0pcs)
rowData(gtex.0pcs)$ave_logrpkm <- ave_logrpkm
# remove PCs from the gene expression matrix after scaling each gene to have mean=0 and variance=1
gtex.4pcs <- gtex.0pcs
assay(gtex.4pcs) <- removePrincipalComponents(t(scale(t(logrpkm.0pcs))), n = 4)
We now recreate a number of diagnostic plots which illustrate the mean-correlation relationship. These plots are similar to that is being presented in (Wang, Hicks, and Hansen 2020).
To compare the distribution of correlation across different expression levels, we re-arrrange the correlation matrix by sortng the row and column by the expression level, and partitioned the correlation matrix into same-size bins, then used each bin to estimate the distribution of correlation for genes with that expression level. Here we use 10 by 10 bins. Each bin corresponds to two groups of genes that correspodnd to two expression levels.
We first looked into the bins on the diagnal, where the two groups of genes have the same expression level. We looked at the distribution of the correlations within each of these bins. We find that the correlations of highly expressed genes are more variant than those of lowly expressed genes. We call such relationship between gene expression level and the distribution of correlation as mean-correlation relationship.
Next, we examine the impact of this mean-correlation relationship on the downstream analysis. A number of studies predict the co-expressed genes by setting a threshold. We therefore examine how the mean-correlation relationship would bias such prediction. We set 0.1% highest correlations as predicted signals in each bin. We compare the distibution of signals for bins on the diagonal. We find that the the predicted signals are stronger for high expressed genes and weaker for low expressed genes. Because of such bias, the true signals among low expressed genes could be hidden and may not be detected.
## Warning in scale_y_discrete(limits = seq_len(10)): Continuous limits supplied to discrete scale.
## ℹ Did you mean `limits = factor(...)` or `scale_*_continuous()`?
Next we looked further into the bins across the entire correlation matrix. We quantify the variance inside a bin using inter-quantile range (IQR), which is the distance from the 1st and 3rd quantile. We use the width of box in the plot to denote IQR. We observe that the IQR would increase with the increase of expression level for either gene.
IQR_list <- get_IQR_condition_exp(cor_m, rowData(gtex.4k)$ave_logrpkm)
plot_IQR_condition_exp(IQR_list)
## Warning in scale_y_discrete(limits = c(seq_len(10)) + c(0:9) * min(IQR_cor_mat)/10, : Continuous limits supplied to discrete scale.
## ℹ Did you mean `limits = factor(...)` or `scale_*_continuous()`?
## Warning in scale_x_discrete(limits = c(seq_len(10)) + c(0:9) * min(IQR_cor_mat)/10, : Continuous limits supplied to discrete scale.
## ℹ Did you mean `limits = factor(...)` or `scale_*_continuous()`?
We next plot the marginal relationship between IQR of correlation and the expression level of the lower expression group for each bin, and we also see a positive relationship.
IQR_unlist <- unlist(lapply(1:10, function(ii) IQR_list$IQR_cor_mat[ii, ii:10]))
plot(rep(IQR_list$grp_mean, times = 1:10),
IQR_unlist,
xlab="min(average(log2RPKM))", ylab="IQR", cex.lab=1.5, cex.axis=1.2, col="blue")
We further examined the difference of the distributions using Q-Q plot. We used the bins with the second highest expression level on the diagonal as reference, and compared the distribution with other bins in the diagonal. We observe both scale and shape difference in the distributions.
To remove the mean-correlation relationship and avoid the bias in the co-expression analysis, we develop a normalization method for correlation matrix, named as spatial quantile normalization (SpQN). The main idea is to use bins to partition the correlation matrix with genes sorted by exp level, and then apply quantile normalization for each bin. To smooth the normalization, we used a larger bin to approximate the empirical distribution of the inner bin. And we applied 1-dimension quantile normalization on the inner bins using its approximated distribution. The target distribution we used here is the bin with the second highest expression level on the diagonal.
Using SpQN is relatively easy. In the function normalize_correlation, imput matrix cor_m is the correlation matrix. Imput vector ave_exp is the average expression level for the genes for the normalized expression matrix, corresponding to the row/column of cor_m. For other types of data, ave_exp can be the vector corresponding to the row/column of the correlation matrix, whose dependency with the distribution of correlations need to be removed. Parameter ngrp is an integer, denoting the number of bins in each row/column to be used to partition the correlation matrix in the process of SpQN. For example, setting ngrp=20 means to partition the correlation matrix into 20 by 20 bins. Parameter size_grp is an integer, denoting the size of the outter bins to be used to appriximate the distribution of the inner bins, in order to smooth the normalization. Note that the product of size_grp and ngrp must be equal or larger than than the row/column number of cor_m, and there is no smoothness in the normalization when they are equal. Parameter ref_grp is the location of the reference bin on the diagonal, whose distribution will be used as target distribution in the normalization.
To evaluate the performance of SpQN, we make the same plots as before, but using the normalized correlation matrix.
Compare distributions of correlations for bins on the diagonal of the correlation matrix.
Distribution of predicted signals and background correlations for bins on the diagonal of the correlation matrix.
## Warning in scale_y_discrete(limits = seq_len(10)): Continuous limits supplied to discrete scale.
## ℹ Did you mean `limits = factor(...)` or `scale_*_continuous()`?
Compare IQRs of bins across the entire correlation matrix.
IQR_spqn_list <- get_IQR_condition_exp(cor_m_spqn, rowData(gtex.4k)$ave_logrpkm)
plot_IQR_condition_exp(IQR_spqn_list)
## Warning in scale_y_discrete(limits = c(seq_len(10)) + c(0:9) * min(IQR_cor_mat)/10, : Continuous limits supplied to discrete scale.
## ℹ Did you mean `limits = factor(...)` or `scale_*_continuous()`?
## Warning in scale_x_discrete(limits = c(seq_len(10)) + c(0:9) * min(IQR_cor_mat)/10, : Continuous limits supplied to discrete scale.
## ℹ Did you mean `limits = factor(...)` or `scale_*_continuous()`?
Q-Q plot to compare the distribution for the bin with the 2nd highest expression level on the diagonal with other bins on the diagonal.
Marginal relationship between IQR and the expression level of the lower expressed group for each bin.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] spqnData_1.18.0 spqn_1.19.0
## [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [5] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [7] IRanges_2.41.1 S4Vectors_0.45.2
## [9] BiocGenerics_0.53.3 generics_0.1.3
## [11] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [13] ggridges_0.5.6 ggplot2_3.5.1
## [15] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 SparseArray_1.7.2
## [4] lattice_0.22-6 digest_0.6.37 magrittr_2.0.3
## [7] evaluate_1.0.1 grid_4.4.2 fastmap_1.2.0
## [10] Matrix_1.7-1 jsonlite_1.8.9 BiocManager_1.30.25
## [13] httr_1.4.7 fansi_1.0.6 UCSC.utils_1.3.0
## [16] scales_1.3.0 jquerylib_0.1.4 abind_1.4-8
## [19] cli_3.6.3 crayon_1.5.3 rlang_1.1.4
## [22] XVector_0.47.0 munsell_0.5.1 DelayedArray_0.33.2
## [25] withr_3.0.2 cachem_1.1.0 yaml_2.3.10
## [28] S4Arrays_1.7.1 tools_4.4.2 colorspace_2.1-1
## [31] GenomeInfoDbData_1.2.13 buildtools_1.0.0 vctrs_0.6.5
## [34] R6_2.5.1 lifecycle_1.0.4 zlibbioc_1.52.0
## [37] pkgconfig_2.0.3 pillar_1.9.0 bslib_0.8.0
## [40] gtable_0.3.6 glue_1.8.0 xfun_0.49
## [43] tibble_3.2.1 sys_3.4.3 knitr_1.49
## [46] farver_2.1.2 htmltools_0.5.8.1 labeling_0.4.3
## [49] rmarkdown_2.29 maketools_1.3.1 compiler_4.4.2