Please cite GenomicSuperSignature as follows:
Oh, S., Geistlinger, L., Ramos, M. et al. GenomicSuperSignature facilitates interpretation of RNA-seq experiments through robust, efficient comparison to public databases. Nat Commun 13, 3695 (2022). https://doi.org/10.1038/s41467-022-31411-3
You can download GenomicSuperSignature from Google Cloud bucket using
GenomicSuperSignature::getModel
function. Currently
available models are built from top 20 PCs of 536 studies (containing
44,890 samples) containing 13,934 common genes from each of 536 study’s
top 90% varying genes based on their study-level standard deviation.
There are two versions of this RAVmodel annotated with different gene
sets for GSEA: MSigDB C2 (C2
) and three priors from PLIER
package (PLIERpriors
). In this vignette, we are showing the
C2
annotated model.
Note that the first interactive run of this code, you will be asked
to allow R to create a cache directory. The model file will be stored
there and subsequent calls to getModel
will read from the
cache.
RAVindex
is a matrix containing genes in rows and RAVs
in columns. colData
slot provides the information on each
RAVs, such as GSEA annotation and studies involved in each cluster.
metadata
slot stores model construction information.
trainingData
slot contains the information on individual
studies in training dataset, such as MeSH terms assigned to each
study.
RAVmodel
#> class: PCAGenomicSignatures
#> dim: 13934 4764
#> metadata(8): cluster size ... version geneSets
#> assays(1): RAVindex
#> rownames(13934): CASKIN1 DDX3Y ... CTC-457E21.9 AC007966.1
#> rowData names(0):
#> colnames(4764): RAV1 RAV2 ... RAV4763 RAV4764
#> colData names(4): RAV studies silhouetteWidth gsea
#> trainingData(2): PCAsummary MeSH
#> trainingData names(536): DRP000987 SRP059172 ... SRP164913 SRP188526
version(RAVmodel)
#> [1] "1.1.1"
geneSets(RAVmodel)
#> [1] "MSigDB C2 version 7.1"
Replicable Axis of Variation (RAV) index
is the main component of GenomicSuperSignature. It serves as an index
connecting new datasets and the existing database. You can access it
through GenomicSuperSignature::RAVindex
(equivalent of
SummarizedExperiment::assay
). Rows are genes and columns
are RAVs.
Here, RAVmodel consists of 13,934 genes and 4,764 RAVs.
class(RAVindex(RAVmodel))
#> [1] "matrix" "array"
dim(RAVindex(RAVmodel))
#> [1] 13934 4764
RAVindex(RAVmodel)[1:4, 1:4]
#> RAV1 RAV2 RAV3 RAV4
#> CASKIN1 0.002449532 -2.845471e-03 0.001601715 -0.009366156
#> DDX3Y 0.006093438 1.870528e-03 -0.004346019 -0.006094526
#> MDGA1 0.009999975 2.610302e-03 -0.002831447 -0.006710757
#> EIF1AY 0.007923652 8.507021e-05 0.002322275 -0.007850784
Metadata slot of RAVmodel contains information related to the model building.
names(metadata(RAVmodel))
#> [1] "cluster" "size" "k" "n" "MeSH_freq"
#> [6] "updateNote" "version" "geneSets"
cluster
: cluster membership of each PCs from the
training datasetsize
: an integer vector with the length of clusters,
containing the number of PCs in each clusterk
: the number of all clusters in the given
RAVmodeln
: the number of top PCs kept from each study in the
training datasetgeneSets
: the name of gene sets used for GSEA
annotationMeSH_freq
: the frequency of MeSH terms associated with
the training dataset. MeSH terms like ‘Humans’ and ‘RNA-seq’ are top
ranked (which is very expected) because the training dataset of this
model is Human RNA sequencing data.updateNote
: a brief note on the given model’s
specificationversion
: the version of the given modelhead(metadata(RAVmodel)$cluster)
#> DRP000987.PC1 DRP000987.PC2 DRP000987.PC3 DRP000987.PC4 DRP000987.PC5
#> 1 2 3 4 5
#> DRP000987.PC6
#> 6
head(metadata(RAVmodel)$size)
#> RAV1 RAV2 RAV3 RAV4 RAV5 RAV6
#> 6 21 4 7 3 3
metadata(RAVmodel)$k
#> [1] 4764
metadata(RAVmodel)$n
#> [1] 20
geneSets(RAVmodel)
#> [1] "MSigDB C2 version 7.1"
head(metadata(RAVmodel)$MeSH_freq)
#>
#> Humans RNA-Seq Transcriptome
#> 1163 591 576
#> RNA Whole Exome Sequencing Sequencing, RNA
#> 372 284 277
updateNote(RAVmodel)
#> [1] "GSEA with set.seed"
metadata(RAVmodel)$version
#> [1] "1.1.1"
You can find which studies are in each cluster using
studies
method. Output is a list with the length of
clusters, where each element is a character vector containing the name
of studies in each cluster.
length(studies(RAVmodel))
#> [1] 4764
studies(RAVmodel)[1:3]
#> $Cl4764_01
#> [1] "DRP000987" "SRP059172" "SRP059959" "SRP118741" "SRP144178" "SRP183186"
#>
#> $Cl4764_02
#> [1] "DRP000987" "ERP022839" "ERP104463" "ERP106996" "ERP109789" "ERP114435"
#> [7] "SRP059039" "SRP059172" "SRP065317" "SRP071547" "SRP073061" "SRP074274"
#> [13] "SRP075449" "SRP092158" "SRP096736" "SRP107337" "SRP111402" "SRP125008"
#> [19] "SRP133442" "SRP135684" "SRP155483"
#>
#> $Cl4764_03
#> [1] "DRP000987" "SRP001540" "SRP056840" "SRP125396"
You can check which PC from different studies are in RAVs using
PCinRAV
.
PCinRAV(RAVmodel, 2)
#> [1] "DRP000987.PC2" "ERP022839.PC2" "ERP104463.PC1" "ERP106996.PC2"
#> [5] "ERP109789.PC2" "ERP114435.PC1" "SRP059039.PC3" "SRP059172.PC2"
#> [9] "SRP065317.PC1" "SRP071547.PC3" "SRP073061.PC2" "SRP074274.PC5"
#> [13] "SRP075449.PC2" "SRP092158.PC2" "SRP096736.PC2" "SRP107337.PC3"
#> [17] "SRP111402.PC1" "SRP125008.PC2" "SRP133442.PC1" "SRP135684.PC4"
#> [21] "SRP155483.PC2"
Silhouette width ranges from -1 to 1 for each cluster. Typically, it
is interpreted as follows:
- Values close to 1 suggest that the observation is well matched to the
assigned cluster
- Values close to 0 suggest that the observation is borderline matched
between two clusters
- Values close to -1 suggest that the observations may be assigned to
the wrong cluster
For RAVmodel, the average silhouette width of each cluster is a quality control measure and suggested as a secondary reference to choose proper RAVs, following validation score.
Pre-processed GSEA results on each RAV are stored in RAVmodel and can
be accessed through gsea
function.
class(gsea(RAVmodel))
#> [1] "list"
class(gsea(RAVmodel)[[1]])
#> [1] "data.frame"
length(gsea(RAVmodel))
#> [1] 4764
gsea(RAVmodel)[1]
#> $RAV1
#> Description NES
#> REACTOME_MITOCHONDRIAL_TRANSLATION REACTOME_MITOCHONDRIAL_TRANSLATION 2.503359
#> REACTOME_TRANSLATION REACTOME_TRANSLATION 2.271857
#> CROONQUIST_STROMAL_STIMULATION_UP CROONQUIST_STROMAL_STIMULATION_UP -3.634893
#> pvalue qvalues
#> REACTOME_MITOCHONDRIAL_TRANSLATION 1.00000e-10 9.218794e-08
#> REACTOME_TRANSLATION 1.00000e-10 9.218794e-08
#> CROONQUIST_STROMAL_STIMULATION_UP 1.20798e-10 9.218794e-08
You can find MeSH terms associated with each study using
mesh
method. Output is a list with the length of studies
used for training. Each element of this output list is a data frame
containing the assigned MeSH terms and the detail of them. The last
column bagOfWords
is the frequency of the MeSH term in the
whole training dataset.
length(mesh(RAVmodel))
#> [1] 536
mesh(RAVmodel)[1]
#> $DRP000987
#> score identifier concept name explanation
#> 35822 1000 DRP000987 C0042333 Genetic Variation Forced N....
#> 58627 18575 DRP000987 C0017431 Genotype Forced L....
#> 149067 37162 DRP000987 C0021247 Indonesia RtM via:....
#> 216848 74145 DRP000987 C0030498 Parasites RtM via:....
#> 716101 22634 DRP000987 C4760607 RNA-Seq RtM via:....
#> 761253 55509 DRP000987 C0024530 Malaria RtM via:....
#> 832289 96999 DRP000987 C0024535 Malaria, Falciparum RtM via:....
#> 1116527 161422 DRP000987 C0032150 Plasmodium falciparum RtM via:....
#> 1230230 7197 DRP000987 C0020969 Innate Immune Response Entry Te....
#> 1468242 162422 DRP000987 C0003062 Animals Parasite....
#> 1573542 19938 DRP000987 C0035668 RNA RtM via: RNA
#> 1596889 162422 DRP000987 C0086418 Humans Forced L....
#> 1627781 36315 DRP000987 C0752248 Transcriptome Analysis Entry Te....
#> 1689323 57176 DRP000987 C3178810 Transcriptome RtM via:....
#> bagOfWords
#> 35822 11
#> 58627 20
#> 149067 2
#> 216848 7
#> 716101 591
#> 761253 6
#> 832289 4
#> 1116527 5
#> 1230230 8
#> 1468242 209
#> 1573542 372
#> 1596889 1163
#> 1627781 66
#> 1689323 576
PCA summary of each study can be accessed through
PCAsummary
method. Output is a list with the length of
studies, where each element is a matrix containing PCA summary results:
standard deviation (SD), variance explained by each PC (Variance), and
the cumulative variance explained (Cumulative).
length(PCAsummary(RAVmodel))
#> [1] 536
PCAsummary(RAVmodel)[1]
#> $DRP000987
#> DRP000987.PC1 DRP000987.PC2 DRP000987.PC3 DRP000987.PC4
#> SD 42.6382804 21.97962707 12.94848367 11.89265640
#> Variance 0.3580557 0.09514629 0.03302091 0.02785537
#> Cumulative 0.3580557 0.45320196 0.48622287 0.51407824
#> DRP000987.PC5 DRP000987.PC6 DRP000987.PC7 DRP000987.PC8
#> SD 10.89460616 10.53655583 9.47278967 8.97268098
#> Variance 0.02337622 0.02186495 0.01767287 0.01585607
#> Cumulative 0.53745446 0.55931942 0.57699228 0.59284836
#> DRP000987.PC9 DRP000987.PC10 DRP000987.PC11 DRP000987.PC12
#> SD 8.43652010 7.90810771 7.60865968 7.076401846
#> Variance 0.01401774 0.01231676 0.01140165 0.009862254
#> Cumulative 0.60686609 0.61918285 0.63058449 0.640446748
#> DRP000987.PC13 DRP000987.PC14 DRP000987.PC15 DRP000987.PC16
#> SD 6.89881271 6.766783788 6.703417100 6.6042884
#> Variance 0.00937346 0.009018116 0.008850009 0.0085902
#> Cumulative 0.64982021 0.658838324 0.667688332 0.6762785
#> DRP000987.PC17 DRP000987.PC18 DRP000987.PC19 DRP000987.PC20
#> SD 6.44257163 6.245257583 6.042236748 5.992461574
#> Variance 0.00817466 0.007681604 0.007190294 0.007072317
#> Cumulative 0.68445319 0.692134796 0.699325091 0.706397408
The workflow to build the RAVmodel is available from https://github.com/shbrief/model_building which is archived in Zenodo with the identifier https://doi.org/10.5281/zenodo.6496552. All analyses presented in the GenomicSuperSignatures manuscript are reproducible using code accessible from https://github.com/shbrief/GenomicSuperSignaturePaper/ and archived in Zenodo with the identifier [https://doi.org/10.5281/zenodo.6496612].
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GenomicSuperSignature_1.15.0 SummarizedExperiment_1.37.0
#> [3] Biobase_2.67.0 GenomicRanges_1.59.1
#> [5] GenomeInfoDb_1.43.2 IRanges_2.41.1
#> [7] S4Vectors_0.45.2 BiocGenerics_0.53.3
#> [9] generics_0.1.3 MatrixGenerics_1.19.0
#> [11] matrixStats_1.4.1 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 blob_1.2.4 dplyr_1.1.4
#> [4] filelock_1.0.3 fastmap_1.2.0 BiocFileCache_2.15.0
#> [7] digest_0.6.37 lifecycle_1.0.4 cluster_2.1.6
#> [10] RSQLite_2.3.8 magrittr_2.0.3 compiler_4.4.2
#> [13] rlang_1.1.4 sass_0.4.9 tools_4.4.2
#> [16] utf8_1.2.4 yaml_2.3.10 knitr_1.49
#> [19] ggsignif_0.6.4 S4Arrays_1.7.1 curl_6.0.1
#> [22] bit_4.5.0 DelayedArray_0.33.2 RColorBrewer_1.1-3
#> [25] abind_1.4-8 withr_3.0.2 purrr_1.0.2
#> [28] sys_3.4.3 grid_4.4.2 fansi_1.0.6
#> [31] ggpubr_0.6.0 colorspace_2.1-1 ggplot2_3.5.1
#> [34] scales_1.3.0 iterators_1.0.14 cli_3.6.3
#> [37] rmarkdown_2.29 crayon_1.5.3 httr_1.4.7
#> [40] rjson_0.2.23 DBI_1.2.3 cachem_1.1.0
#> [43] zlibbioc_1.52.0 parallel_4.4.2 BiocManager_1.30.25
#> [46] XVector_0.47.0 vctrs_0.6.5 Matrix_1.7-1
#> [49] jsonlite_1.8.9 carData_3.0-5 car_3.1-3
#> [52] GetoptLong_1.0.5 bit64_4.5.2 rstatix_0.7.2
#> [55] irlba_2.3.5.1 Formula_1.2-5 clue_0.3-66
#> [58] maketools_1.3.1 foreach_1.5.2 jquerylib_0.1.4
#> [61] tidyr_1.3.1 glue_1.8.0 codetools_0.2-20
#> [64] shape_1.4.6.1 gtable_0.3.6 UCSC.utils_1.3.0
#> [67] ComplexHeatmap_2.23.0 munsell_0.5.1 tibble_3.2.1
#> [70] pillar_1.9.0 htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
#> [73] circlize_0.4.16 dbplyr_2.5.0 R6_2.5.1
#> [76] doParallel_1.0.17 evaluate_1.0.1 lattice_0.22-6
#> [79] png_0.1-8 backports_1.5.0 memoise_2.0.1
#> [82] broom_1.0.7 bslib_0.8.0 SparseArray_1.7.2
#> [85] xfun_0.49 buildtools_1.0.0 pkgconfig_2.0.3
#> [88] GlobalOptions_0.1.2