Supplementary Information for “Detect issue heterogenity in gene expression data with BioQC” (Jitao David Zhang, Klas Hatje, Gregor Sturm, Clemens Broger, Martin Ebeling, Martine Burtin, Fabiola Terzi, Silvia Ines Pomposiello and Laura Badi)
In this vignette, we perform simulations with both model-generated and real-world data using BioQC. We show that BioQC is a sensitive method to detect tissue heterogeneity from high-throughput gene expression data. The source code used to produce this document can be found in the github repository BioQC-example.
BioQC is a R/Bioconductor package to detect tissue heterogeneity from high-throughput gene expression profiling data. It implements an efficient Wilcoxon-Mann-Whitney test, and offers tissue-specific gene signatures that are ready to use ‘out of the box’.
In this document, we perform two simulation studies with BioQC:
All source code that is needed to reproduce the results can be found
in the .Rmd
file generating this document.
library(BioQC)
library(hgu133plus2.db) ## to simulate an microarray expression dataset
library(lattice)
library(latticeExtra)
library(gridExtra)
library(gplots)
pdf.options(family="ArialMT", useDingbats=FALSE)
set.seed(1887)
## list human genes
humanGenes <- unique(na.omit(unlist(as.list(hgu133plus2SYMBOL))))
## read tissue-specific gene signatures
gmtFile <- system.file("extdata/exp.tissuemark.affy.roche.symbols.gmt",
package="BioQC")
gmt <- readGmt(gmtFile)
tissueInds <- sapply(gmt, function(x) match(x$genes, humanGenes))
We next asked the question how sensitive BioQC is to expression changes of tissue signature genes. Similar to the previous simulation, we create random expression matrices. While keeping all other genes i.i.d normally distributed following 𝒩(0, 1), we dedicatedly increase the expression of genes in one randomly selected tissue signature (ovary, with 43 genes) by different amplitudes: these genes’ expression levels are randomly drawn from different normal distributions with varying expectation and constant variance between 𝒩(0, 1) and 𝒩(3, 1). To test the robustness of the algorithm, 10 samples are generated for each mean expression difference value.
The above figure visualizes the distribution of enrichment scores and their ranks dependent on the mean expression difference between ovary signature genes and background genes. As soon as the expression of signature genes increases by a very moderate ampltiude 1σ, BioQC will identify the gene set as the highest-ranking signature. A even stronger difference in expression will lead to higher enrichment scores but no change in the rank.
The results suggest that BioQC is sensitive even to moderate changes in the average expression of a gene set.
The sensitivity benchmark above suffers from the limitation that the distributions of gene expression are not physiological. To overcome this, we designed and performed a benchmark by in silico mixing expression profiles with weighted linear combination, thereby mimicking tissue contamination.
Given the expression profile of a sample of tissue A (YA), and that of a sample of tissue B (YB), the weighted linear mixing produces a new profile Y = ωYA + (1 − ω)YB, where ω ∈ [0, 1]. In essence the profiles of two tissue types are linearly mixed in different proportions, which simulates varying severities of contaminations. We asked whether BioQC could detect such mixings, and if so, how sensitive is the method.
In order to avoid over-fitting of signatures derived from human expression data, we decided to use a normal tissue expression dataset from a non-human mammal species, because it has been shown that tissue-preferential expression patterns tend to be conserved between mammal species. We identified a dataset of Canis lupus familiaris (dog), which is publicly available in Gene Expression Omnibus (GDS4164).
In this study, the authors examined 39 samples from 10 pathologically normal tissues (liver, kidney, heart, lung, brain, lymph node, spleen, jejunum, pancreas, and skeletal muscle) of four dogs (with one pancreas sample missing). We downloaded the data, and performed minimal pre-processing: for multiple probesets that map to same genes, we kept the one with the highest average expression level and removed the rest. The resulting dataset contained expression of 16797 genes. BioQC was applied to the dataset to test whether there are major contamination issues. The results, including tissues reported by the authors, and the BioQC tissue signatures with the highest and second-highest scores, are reported in the following table:
Label | BioQC.best | BioQC.second | |
---|---|---|---|
GSM502573 | Brain | Spinal_cord | Nodose_nucleus |
GSM502574 | Brain | Brain_Cortex_prefrontal | Brain_Amygdala |
GSM502575 | Brain | Brain_Cortex_prefrontal | Brain_Amygdala |
GSM502576 | Brain | Brain_Cortex_prefrontal | Brain_Amygdala |
GSM502577 | Heart | Muscle_cardiac | Muscle_skeletal |
GSM502578 | Heart | Muscle_cardiac | Muscle_skeletal |
GSM502579 | Heart | Muscle_cardiac | Muscle_skeletal |
GSM502580 | Heart | Muscle_cardiac | Muscle_skeletal |
GSM502581 | Jejunum | Intestine_small | Intestine_Colon |
GSM502582 | Jejunum | Intestine_small | Intestine_Colon |
GSM502583 | Jejunum | Intestine_small | Intestine_Colon |
GSM502584 | Jejunum | Intestine_small | Intestine_Colon |
GSM502585 | Kidney | Kidney | Kidney_Renal_Cortex |
GSM502586 | Kidney | Kidney | Kidney_Renal_Cortex |
GSM502587 | Kidney | Kidney | Kidney_Renal_Cortex |
GSM502588 | Kidney | Kidney | Kidney_Renal_Cortex |
GSM502589 | Liver | Liver | Liver |
GSM502590 | Liver | Liver | Liver |
GSM502591 | Liver | Liver | Liver |
GSM502592 | Liver | Liver | Liver |
GSM502593 | Lung | Lung | Monocytes |
GSM502594 | Lung | Monocytes | Lung |
GSM502595 | Lung | Lung | Monocytes |
GSM502596 | Lung | Monocytes | Lung |
GSM502597 | LymphNode | Lymphocyte_B_FOLL | Lymphocytes_T_H |
GSM502598 | LymphNode | Lymphocyte_B_FOLL | Lymphocytes_T_H |
GSM502599 | LymphNode | Lymphocyte_B_FOLL | Lymphocytes_T_H |
GSM502600 | LymphNode | Lymphocyte_B_FOLL | Lymphocytes_T_H |
GSM502601 | Pancreas | Pancreas_Islets | Pancreas |
GSM502602 | Pancreas | Pancreas_Islets | Pancreas |
GSM502603 | Pancreas | Pancreas_Islets | Pancreas |
GSM502604 | SkeletalMuscle | Muscle_skeletal | Muscle_cardiac |
GSM502605 | SkeletalMuscle | Muscle_skeletal | Muscle_cardiac |
GSM502606 | SkeletalMuscle | Muscle_skeletal | Muscle_cardiac |
GSM502607 | SkeletalMuscle | Muscle_skeletal | Muscle_cardiac |
GSM502608 | Spleen | Monocytes | Lymphocyte_B_FOLL |
GSM502609 | Spleen | Monocytes | Lymphocyte_B_FOLL |
GSM502610 | Spleen | Monocytes | Erythroid_cells |
GSM502611 | Spleen | Monocytes | Myeloblast |
By comparing the tissue labels provided by the authors and the predictions of BioQC, we notice that in most cases the two match well (despite of ontological differences). In three cases (sample ID GSM502573, GSM502594, and GSM502596) though, there seem to be intriguing differences, which might be explained by different sampling procedures or immune cell infiltration. We will however in this vignette not further explore them. These three samples are removed from the simulation procedures.
As an example, we take average expression of heart and jejunum samples, and mix them by different compositions. This allows us comparing enrichment scores and their ranks when the expression profiles of heart and jejunum are mixed in silico:
We observe that with as little as 5% contamination of heart tissue in jejunum samples (rightmost in the right panel), the rank of heart signature jumps from 34 to 9; 10% and 20% contamination will further enhance the rank to 4 and 3 respectively. If we start from the other end, namely assuming jejunum contamination in heart samples, the BioQC algorithms ranks jejunum the 7th only when there are more than 25% contamination. If we set enrichment score equal or over 3 as the threshold of calling a suspected contamination event (p < 0.001 in the one-sided Wilcoxon-Mann-Whitney test), it takes about 10% heart in jejunum tissue or about 30% jejunum tissue in heart to make a call. It means the sensitivity of contamination detection is not symmetric between tissues: contamination by tissues with distinct expression patterns (such as heart) are easier to be detected than contamination by tissues with less distinct expression patterns (such as small intestine).
While it is difficult to quantify the absolute sensitivity of contamination detection, it is apparent that if the enrichment score of a unforeseen tissue is very high (or ranked high), one may suspect potential contamination. Also, if there are replicates of samples from the same tissue, a higher value in one sample compared with the other samples suggests a contamination or infiltration incident.
Following the heart-jejunum example, we performed all 45 pairwise mixing experiments, producing weighted linear combinations of gene expression profiles of each pair of tissues (excluding self-mixing). The results are summaried in a heatmap:
The heatmap visualization summarizes the detection limit of contamination of each pair of tissues. Take the cell in row 1 column 2 from top left: its value (0.15) means that if there are 15% or more contamination by heart in the brain sample, BioQC will be able to detect it (with the threshold enrichment score ≥ 3 or the rank ≤ 10), because the enrichment score is equal to or larger than 3, or the heart tissue signature ranks in the top 10 of all tissue signatures.
Take another cell in row 2 column 1 from top left: its value (0.5) means that if there are 50% or more contanmination by brain in a heart sample, BioQC will be able to detect it. Here we observe the asymmetry again that we observed before with the heart/jejenum example: while it is realtively easy to identify heart contamination of a brain sample, it is more difficult to identify brain contamination of a heart sample in this dataset.
The average detection limits of tissues as contamination sources are listed in the following table. The values are derived from median values of each column in the heatmap, except for diagonal and missing elements.
Tissue | MedianDetectionLimit |
---|---|
Brain | 52.22% |
Heart | 10.62% |
Jejunum | 19.44% |
Kidney | 23.75% |
Liver | 11.87% |
Lung | 31.67% |
LymphNode | 30.63% |
Pancreas | 17.22% |
SkeletalMuscle | 15.00% |
Spleen | 13.57% |
Interestingly, brain samples are a special case: if they contaminate other tissues, it is more difficult to identify (but not other way around). It can be partially explained by the experiment design: Briggs et al. profiled the whole cerebrum, whereas in BioQC there are 22 distinct gene sets assigned to distinct brain regions. Though the prefrontal cortex signature scored highest in the canine brain samples, its score is relative low (7.45), and the genes in the signature are not too far away from the background genes:
Therefore only a strong contamination by brain in this dataset will be detected by the given threshold. We expect that if prefrontal cortex instead of cerebrum sample was profiled, the mixing profile of brain will be similar to other organs. This needs to be tested in other datasets.
Apart from that, most in silico contamination events are detectable in this dataset, with median detection limit around 0.2. This suggests that BioQC is sensitive towards tissue heterogeneity in physiological settings.
Benchmark studies with similated and real-world data demonstrate that BioQC is an efficient and sensitive method to detect tissue heterogeneity from high-throughput gene expression data.
In the context of the dog transcriptome dataset, we can compare the results of principal component analysis (PCA) with that of BioQC:
PCA sugggests that samples from each tissue tend to cluster together, in line with the BioQC results. In addition, PCA reveals that tissues with cells of similar origins cluster together, such as skeletal muscle and heart. As expected, one brain sample and two lung samples seem to be different from other samples of the same cluster, which are consistent with the BioQC findings; however, unlike BioQC, PCA does not provide information on what are potential contamination/infiltration casues.
Therefore, we believe BioQC complements existing unsupervised methods to inspect quality of gene expression data.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] hgu133plus2.db_3.13.0 rbenchmark_1.0.0 gplots_3.2.0
## [4] gridExtra_2.3 latticeExtra_0.6-30 lattice_0.22-6
## [7] org.Hs.eg.db_3.20.0 AnnotationDbi_1.69.0 IRanges_2.41.1
## [10] S4Vectors_0.45.2 testthat_3.2.1.1 limma_3.63.2
## [13] RColorBrewer_1.1-3 BioQC_1.35.0 Biobase_2.67.0
## [16] BiocGenerics_0.53.3 generics_0.1.3 knitr_1.49
## [19] rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.47.0 gtable_0.3.6 xfun_0.49
## [4] bslib_0.8.0 caTools_1.18.3 bitops_1.0-9
## [7] vctrs_0.6.5 tools_4.4.2 RSQLite_2.3.8
## [10] blob_1.2.4 pkgconfig_2.0.3 KernSmooth_2.23-24
## [13] desc_1.4.3 lifecycle_1.0.4 GenomeInfoDbData_1.2.13
## [16] compiler_4.4.2 deldir_2.0-4 Biostrings_2.75.1
## [19] brio_1.1.5 statmod_1.5.0 GenomeInfoDb_1.43.2
## [22] htmltools_0.5.8.1 sys_3.4.3 buildtools_1.0.0
## [25] sass_0.4.9 yaml_2.3.10 crayon_1.5.3
## [28] jquerylib_0.1.4 cachem_1.1.0 gtools_3.9.5
## [31] locfit_1.5-9.10 digest_0.6.37 maketools_1.3.1
## [34] rprojroot_2.0.4 fastmap_1.2.0 grid_4.4.2
## [37] cli_3.6.3 magrittr_2.0.3 edgeR_4.5.0
## [40] UCSC.utils_1.3.0 bit64_4.5.2 XVector_0.47.0
## [43] httr_1.4.7 jpeg_0.1-10 bit_4.5.0
## [46] interp_1.1-6 png_0.1-8 memoise_2.0.1
## [49] evaluate_1.0.1 rlang_1.1.4 Rcpp_1.0.13-1
## [52] glue_1.8.0 DBI_1.2.3 pkgload_1.4.0
## [55] jsonlite_1.8.9 R6_2.5.1 zlibbioc_1.52.0