You can download GenomicSuperSignature from Google Cloud bucket using
GenomicSuperSignature::getModel
function. Currently
available models are built from top 20 PCs of 536 studies (containing
44,890 samples) containing 13,934 common genes from each of 536 study’s
top 90% varying genes based on their study-level standard deviation.
There are two versions of this RAVmodel annotated with different gene
sets for GSEA - MSigDB C2 (C2
) and three priors from PLIER
package (PLIERpriors
).
The demo in this vignette is based on human B-cell expression data,
so we are using the PLIERpriors
model annotated with
blood-associated gene sets.
Note that the first interactive run of this code, you will be asked
to allow R to create a cache directory. The model file will be stored
there and subsequent calls to getModel
will read from the
cache.
RAVmodel <- getModel("PLIERpriors", load=TRUE)
#> [1] "downloading"
RAVmodel
#> class: PCAGenomicSignatures
#> dim: 13934 4764
#> metadata(8): cluster size ... version geneSets
#> assays(1): RAVindex
#> rownames(13934): CASKIN1 DDX3Y ... CTC-457E21.9 AC007966.1
#> rowData names(0):
#> colnames(4764): RAV1 RAV2 ... RAV4763 RAV4764
#> colData names(4): RAV studies silhouetteWidth gsea
#> trainingData(2): PCAsummary MeSH
#> trainingData names(536): DRP000987 SRP059172 ... SRP164913 SRP188526
version(RAVmodel)
#> [1] "1.1.1"
The human B-cell dataset (Gene Expression Omnibus series GSE2350) consists of 211 normal and tumor human B-cell phenotypes. This dataset was generated on Affymatrix HG-U95Av2 arrays and stored in an ExpressionSet object with 6,249 features x 211 samples.
data(bcellViper)
dset
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 6249 features, 211 samples
#> element names: exprs
#> protocolData: none
#> phenoData
#> sampleNames: GSM44075 GSM44078 ... GSM44302 (211 total)
#> varLabels: sampleID description detailed_description
#> varMetadata: labelDescription
#> featureData: none
#> experimentData: use 'experimentData(object)'
#> Annotation:
You can provide your own expression dataset in any of these formats: simple matrix, ExpressionSet, or SummarizedExperiment. Just make sure that genes are in a ‘symbol’ format.
validate
function calculates validation score, which
provides a quantitative representation of the relevance between a new
dataset and RAV. RAVs that give the validation score is called
validated RAV. The validation results can be displayed
in different ways for more intuitive interpretation.
val_all <- validate(dset, RAVmodel)
head(val_all)
#> score PC sw cl_size cl_num
#> RAV1 0.2249382 2 -0.05470163 6 1
#> RAV2 0.3899341 2 0.06426256 21 2
#> RAV3 0.1887409 2 -0.01800335 4 3
#> RAV4 0.2309703 6 -0.04005584 7 4
#> RAV5 0.2017431 2 0.05786189 3 5
#> RAV6 0.1903600 8 -0.02520973 3 6
heatmapTable
takes validation results as its input and
displays them into a two panel table: the top panel shows the average
silhouette width (avg.sw) and the bottom panel displays the validation
score.
heatmapTable
can display different subsets of the
validation output. For example, if you specify scoreCutoff
,
any validation result above that score will be shown. If you specify the
number (n) of top validation results through num.out
, the
output will be a n-columned heatmap table. You can also use the average
silhouette width (swCutoff
), the size of cluster
(clsizecutoff
), one of the top 8 PCs from the dataset
(whichPC
).
Here, we print out top 5 validated RAVs with average silhouette width above 0.
Under the default condition, plotValidate
plots
validation results of all non single-element RAVs in one graph, where
x-axis represents average silhouette width of the RAVs (a quality
control measure of RAVs) and y-axis represents validation score. We
recommend users to focus on RAVs with higher validation score and use
average silhouette width as a secondary criteria.
Note that interactive = TRUE
will result in a zoomable,
interactive plot that included tooltips.
You can hover each data point for more information:
If you double-click the PC legend on the right, you will enter an individual display mode where you can add an additional group of data point by single-click.
GenomicSuperSignature connects different public databases and prior information through RAVmodel, creating the knowledge graph illustrated below. Through RAVs, you can access and explore the knowledge graph from multiple entry points such as gene expression profiles, publications, study metadata, keywords in MeSH terms and gene sets.
You can draw a wordcloud with the enriched MeSH terms of RAVs. Based
on the heatmap table above, three RAVs (2538, 1139, and 884) show the
high validation scores with the positive average silhouette widths, so
we draw wordclouds of those RAVs using drawWordcloud
function. You need to provide RAVmodel and the index of the RAV you are
interested in.
Index of validated RAVs can be easily collected using
validatedSingatures
function, which outputs the validated
index based on num.out
, PC from dataset
(whichPC
) or any *Cutoff
arguments in the same
way as heatmapTable
. Here, we choose the top 3 RAVs with
the average silhouette width above 0, which will returns RAV2538,
RAV1139, and RAV884 as we discussed above.
validated_ind <- validatedSignatures(val_all, RAVmodel, num.out = 3,
swCutoff = 0, indexOnly = TRUE)
validated_ind
#> [1] 2538 1139 884
And we plot the wordcloud of those three RAVs.
You can directly access the GSEA outputs for each RAV using
annotateRAV
function. Based on the wordclouds, RAV1139
seems to be associated with B-cell.
annotateRAV(RAVmodel, validated_ind[2]) # RAV1139
#> Description NES pvalue
#> 1 DMAP_ERY3 -2.179082 1e-10
#> 2 KEGG_ALZHEIMERS_DISEASE -2.443701 1e-10
#> 3 REACTOME_POST_TRANSLATIONAL_PROTEIN_MODIFICATION -2.458995 1e-10
#> 4 REACTOME_APOPTOSIS -2.645437 1e-10
#> 5 KEGG_HUNTINGTONS_DISEASE -2.670108 1e-10
#> qvalues
#> 1 6.390977e-10
#> 2 6.390977e-10
#> 3 6.390977e-10
#> 4 6.390977e-10
#> 5 6.390977e-10
If you want to check the enriched pathways for multiple RAVs, use
subsetEnrichedPathways
function instead.
subsetEnrichedPathways(RAVmodel, validated_ind[2], include_nes = TRUE)
#> DataFrame with 10 rows and 2 columns
#> RAV1139.Description RAV1139.NES
#> <character> <numeric>
#> Up_1 DMAP_ERY3 -2.17908
#> Up_2 KEGG_ALZHEIMERS_DISE.. -2.44370
#> Up_3 REACTOME_POST_TRANSL.. -2.45899
#> Up_4 REACTOME_APOPTOSIS -2.64544
#> Up_5 KEGG_HUNTINGTONS_DIS.. -2.67011
#> Up_6 REACTOME_MRNA_PROCES.. -2.67431
#> Up_7 REACTOME_HOST_INTERA.. -2.69593
#> Up_8 PID_E2F_PATHWAY -2.74887
#> Up_9 KEGG_PYRIMIDINE_META.. -2.75404
#> Up_10 REACTOME_CDK_MEDIATE.. -2.75535
subsetEnrichedPathways(RAVmodel, validated_ind, include_nes = TRUE)
#> DataFrame with 10 rows and 6 columns
#> RAV2538.Description RAV2538.NES RAV1139.Description RAV1139.NES
#> <character> <numeric> <character> <numeric>
#> Up_1 REACTOME_CELL_CYCLE 3.25799 DMAP_ERY3 -2.17908
#> Up_2 REACTOME_CELL_CYCLE_.. 3.22323 KEGG_ALZHEIMERS_DISE.. -2.44370
#> Up_3 REACTOME_DNA_REPLICA.. 3.18974 REACTOME_POST_TRANSL.. -2.45899
#> Up_4 REACTOME_MITOTIC_M_M.. 3.10711 REACTOME_APOPTOSIS -2.64544
#> Up_5 REACTOME_MITOTIC_PRO.. 2.91384 KEGG_HUNTINGTONS_DIS.. -2.67011
#> Up_6 KEGG_CELL_CYCLE 2.86725 REACTOME_MRNA_PROCES.. -2.67431
#> Up_7 REACTOME_CHROMOSOME_.. 2.85931 REACTOME_HOST_INTERA.. -2.69593
#> Up_8 REACTOME_S_PHASE 2.85345 PID_E2F_PATHWAY -2.74887
#> Up_9 REACTOME_MITOTIC_G1_.. 2.83497 KEGG_PYRIMIDINE_META.. -2.75404
#> Up_10 REACTOME_CELL_CYCLE_.. 2.80616 REACTOME_CDK_MEDIATE.. -2.75535
#> RAV884.Description RAV884.NES
#> <character> <numeric>
#> Up_1 DMAP_ERY3 -1.43162
#> Up_2 REACTOME_METABOLISM_.. -1.54368
#> Up_3 KEGG_ALZHEIMERS_DISE.. -1.62716
#> Up_4 KEGG_HUNTINGTONS_DIS.. -1.65062
#> Up_5 KEGG_CELL_CYCLE -1.65473
#> Up_6 REACTOME_MITOTIC_G1_.. -1.67482
#> Up_7 REACTOME_METABOLISM_.. -1.68713
#> Up_8 REACTOME_CELL_CYCLE -1.69922
#> Up_9 REACTOME_CELL_CYCLE_.. -1.72169
#> Up_10 REACTOME_DNA_REPLICA.. -1.76756
You can also find the RAVs annotated with the keyword-containing
pathways using findSignature
function. Without the
k
argument, this function outputs a data frame with two
columns: the number of RAVs (Freq
column) with the
different numbers of keyword-containing, enriched pathways
(# of keyword-containing pathways
column).
Here, we used the keyword, “Bcell”.
findSignature(RAVmodel, "Bcell")
#> # of keyword-containing pathways Freq
#> 1 0 4678
#> 2 1 46
#> 3 2 15
#> 4 3 16
#> 5 4 6
#> 6 5 3
There are two RAVs with five keyword-containing pathways (row 6). We can check which RAVs they are.
Enriched pathways are ordered by NES and you can check the rank of
any keyword- containing pathways using
findKeywordInRAV
.
You can check all enriched pathways of RAV using
subsetEnrichedPathways
function. If both=TRUE
,
both the top and bottom enriched pathways will be printed.
## Chosen based on validation/MeSH terms
subsetEnrichedPathways(RAVmodel, ind = validated_ind[2], n = 3, both = TRUE)
#> DataFrame with 6 rows and 1 column
#> RAV1139.Description
#> <character>
#> Up_1 DMAP_ERY3
#> Up_2 KEGG_ALZHEIMERS_DISE..
#> Up_3 REACTOME_POST_TRANSL..
#> Down_1 REACTOME_CELL_CYCLE_..
#> Down_2 REACTOME_DNA_REPLICA..
#> Down_3 REACTOME_CELL_CYCLE
## Chosen based on enriched pathways
subsetEnrichedPathways(RAVmodel, ind = 695, n = 3, both = TRUE)
#> DataFrame with 6 rows and 1 column
#> RAV695.Description
#> <character>
#> Up_1 IRIS_Bcell-Memory_Ig..
#> Up_2 DMAP_BCELLA3
#> Up_3 IRIS_Bcell-naive
#> Down_1 SVM B cells memory
#> Down_2 DMAP_BCELLA1
#> Down_3 IRIS_PlasmaCell-From..
subsetEnrichedPathways(RAVmodel, ind = 953, n = 3, both = TRUE)
#> DataFrame with 6 rows and 1 column
#> RAV953.Description
#> <character>
#> Up_1 IRIS_Bcell-Memory_Ig..
#> Up_2 IRIS_Bcell-Memory_IgM
#> Up_3 IRIS_Bcell-naive
#> Down_1 IRIS_Monocyte-Day0
#> Down_2 IRIS_DendriticCell-L..
#> Down_3 IRIS_Monocyte-Day1
subsetEnrichedPathways(RAVmodel, ind = 1994, n = 3, both = TRUE)
#> DataFrame with 6 rows and 1 column
#> RAV1994.Description
#> <character>
#> Up_1 SVM T cells CD4 memo..
#> Up_2 DMAP_TCELLA6
#> Up_3 DMAP_TCELLA8
#> Down_1 IRIS_Bcell-naive
#> Down_2 IRIS_Bcell-Memory_IgM
#> Down_3 IRIS_Bcell-Memory_Ig..
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] bcellViper_1.41.0 GenomicSuperSignature_1.15.0
#> [3] SummarizedExperiment_1.35.5 Biobase_2.67.0
#> [5] GenomicRanges_1.57.2 GenomeInfoDb_1.41.2
#> [7] IRanges_2.39.2 S4Vectors_0.43.2
#> [9] BiocGenerics_0.53.0 MatrixGenerics_1.17.1
#> [11] matrixStats_1.4.1 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 rlang_1.1.4 magrittr_2.0.3
#> [4] clue_0.3-65 GetoptLong_1.0.5 compiler_4.4.1
#> [7] RSQLite_2.3.7 png_0.1-8 vctrs_0.6.5
#> [10] pkgconfig_2.0.3 shape_1.4.6.1 crayon_1.5.3
#> [13] fastmap_1.2.0 backports_1.5.0 dbplyr_2.5.0
#> [16] XVector_0.45.0 labeling_0.4.3 utf8_1.2.4
#> [19] rmarkdown_2.28 tzdb_0.4.0 UCSC.utils_1.1.0
#> [22] purrr_1.0.2 bit_4.5.0 xfun_0.48
#> [25] zlibbioc_1.51.2 cachem_1.1.0 jsonlite_1.8.9
#> [28] blob_1.2.4 highr_0.11 DelayedArray_0.31.14
#> [31] broom_1.0.7 irlba_2.3.5.1 parallel_4.4.1
#> [34] cluster_2.1.6 R6_2.5.1 bslib_0.8.0
#> [37] RColorBrewer_1.1-3 car_3.1-3 jquerylib_0.1.4
#> [40] Rcpp_1.0.13 iterators_1.0.14 knitr_1.48
#> [43] readr_2.1.5 Matrix_1.7-1 tidyselect_1.2.1
#> [46] abind_1.4-8 yaml_2.3.10 doParallel_1.0.17
#> [49] codetools_0.2-20 curl_5.2.3 lattice_0.22-6
#> [52] tibble_3.2.1 withr_3.0.2 evaluate_1.0.1
#> [55] BiocFileCache_2.15.0 circlize_0.4.16 pillar_1.9.0
#> [58] BiocManager_1.30.25 ggpubr_0.6.0 filelock_1.0.3
#> [61] carData_3.0-5 foreach_1.5.2 generics_0.1.3
#> [64] hms_1.1.3 ggplot2_3.5.1 munsell_0.5.1
#> [67] scales_1.3.0 glue_1.8.0 maketools_1.3.1
#> [70] tools_4.4.1 sys_3.4.3 ggsignif_0.6.4
#> [73] buildtools_1.0.0 grid_4.4.1 tidyr_1.3.1
#> [76] colorspace_2.1-1 GenomeInfoDbData_1.2.13 Formula_1.2-5
#> [79] cli_3.6.3 fansi_1.0.6 S4Arrays_1.5.11
#> [82] ComplexHeatmap_2.23.0 dplyr_1.1.4 gtable_0.3.6
#> [85] rstatix_0.7.2 sass_0.4.9 digest_0.6.37
#> [88] wordcloud_2.6 SparseArray_1.5.45 farver_2.1.2
#> [91] rjson_0.2.23 memoise_2.0.1 htmltools_0.5.8.1
#> [94] lifecycle_1.0.4 httr_1.4.7 GlobalOptions_0.1.2
#> [97] bit64_4.5.2