Up and running with pcaExplorer
First things first: install pcaExplorer and load it into your R session. You should receive a message notification if this is completed without errors.
This document describes a use case for pcaExplorer, based on the dataset in the airway package. If this package is not available on your machine, please install it by executing:
This dataset consists of the gene-level expression measurements (as raw read counts) for an experiment where four different human airway smooth muscle cell lines are either treated with dexamethasone or left untreated.
To start the exploration, you just need the following lines:
The easiest way to explore the airway dataset is by clicking on the dedicated button in the Data Upload panel. This action will:
dds
object, normalize the expression values
(using the robust method proposed by Anders and Huber in the original
DESeq manuscript), and compute the variance stabilizing transformed
expression values (stored in the dst
object)If you want to load your expression data, please refer to the User Guide, which contains detailed information on the formats your data have to respect.
Once the preprocessing of the input is done, you should get a notification in the lower right corner that you’re all set. The whole preprocessing should take around 5-6 seconds (tested on a MacBook Pro, with i7 and 16 Gb RAM). You can check how each component looks like by clicking on its respective button, once they appeared in the lower half of the panel.
You can proceed to explore the expression values of your dataset in
the Counts Table tab. You can change the data type you
are displaying between raw counts, normalized, or transformed, and plot
their values in a scatterplot matrix to explore their sample-to-sample
correlations. To try this, select for example “Normalized counts”,
change the correlation coefficient to “spearman”, and click on the
Run
action button. The correlation values will also be
displayed as a heatmap.
Additional features, both for samples and for features, are displayed
in the Data overview panel. A closer look at the
metadata of the airway
set highlights how each combination
of cell type (cell
) and dexamethasone treatment
(dex
) is represented by a single sequencing experiment. The
8 samples in the demo dataset are themselves a subsample of the full
GEO record, namely the ones non treated with albuterol
(alb
column).
The relationship among samples can be seen in the sample-to-sample heatmap. For example, by selecting the Manhattan distance metric, it is evident how the samples cluster by dex treatment, yet they show a dendrogram structure that recalls the 4 different cell types used. The total sum of counts per sample is displayed as a bar plot.
Patterns can become clearer after selecting, in the App
settings on the left, an experimental factor to group and color
by: try selecting dex
, for example. If more than one
covariate is selected, the interaction between these will be taken as a
grouping factor. To remove one, simply click on it to highlight and
press the del or backspace key to delete it. Try doing so by also
clicking on cell
, and then removing dex
afterwards.
Basic summary information is also displayed for the genes. In the count matrix provided, one can check how many genes were detected, by selecting a “Threshold on the row sums of the counts” or on the row means of the normalized counts (more stringent). For example, selecting 5 in both cases, only 24345 genes have a total number of counts, summed by row, and 17745 genes have more than 5 counts (normalized) on average.
The Samples View and the Genes View
are the tabs where most results coming from Principal Component
Analysis, either performed on the samples or on the genes, can be
explored in depth. Assuming you selected cell
in the
“Group/color by” option on the left, the Samples PCA plot should clearly
display how the cell type explain a considerable portion of the
variability in the dataset (corresponding to the second PC). To check
that dex
treatment is the main source of variability,
select that instead of cell
.
The scree plot on the right shows how many components should be retained for a satisfactory reduced dimension view of the original set, with their eigenvalues from largest to smallest. To explore the PCs other than the first and the second one, you can just select them in the x-axis PC and y-axis PC widgets in the left sidebar.
If you brush (left-click and hold) on the PCA plot, you can display a
zoomed version of it in the frame below. If you suspect some samples
might be outliers (this is not the case in the airway
set,
still), you can select them in the dedicated plot, and give a first
check on how the remainder of the samples would look like. On the right
side, you can quickly check which genes show the top and bottom
loadings, split by principal component. First, change the value in the
input widget to 20; then, select one of each list and try to check them
in the Gene Finder tab; try for example with
DUSP1, PER1, and DDX3Y.
While DUSP1 and PER1 clearly show a change in
expression upon dexamethasone treatment (and indeed where reported among
the well known glucocorticoid-responsive genes in the original
publication of Himes et al., 2014), DDX3Y displays variability
at the cell type level (select cell
in the Group/color by
widget): this gene is almost undetected in N061011 cells, and this high
variance is what determines its high loading on the second principal
component.
You can see the single expression values in a table as well, and this information can be downloaded with a simple click.
Back to the Samples View, you can experiment with the number of top variable genes to see how the results of PCA are in this case robust to a wide range of this value - this might not be the case with other datasets, and the simplicity of interacting with these parameters makes it easy to iterate in the exploration steps.
Proceeding to the Genes View, you can see the dual of the Samples PCA: now the samples are displayed as arrows in the genes biplot, which can show which genes display a similar behaviour. You can capture this with a simple brushing action on the plot, and notice how their profiles throughout all samples are shown in the Profile explorer below; moreover, a static and an interactive heatmap, together with a table containing the underlying data, are generated in the rows below.
Since we compute the gene annotation table as well, it’s nice to read the gene symbols in the zoomed window (instead of the ENSEMBL ids). By clicking close enough to any of these genes, the expression values are plotted, in a similar fashion as in the Gene Finder.
The tab PCA2GO helps you understanding which are the
biological common themes (default: the Gene Ontology Biological Process
terms) in the genes showing up in the top and in the bottom loadings for
each principal component. Since we launched the pcaExplorer
app without additional parameters, this information is not available,
but can be computed live (this might take a while).
Still, a previous call to pca2go
is recommended, as it
relies on the algorithm of the topGO
package: it will require some additional computing time, but it is
likely to deliver more precise terms (i.e. in turn more relevant from
the point of view of their biological relevance). To do so, you should
exit the live session, compute this object, and provide it in the call
to pcaExplorer
(see more how to do so in the
main user guide).
A typical session with pcaExplorer
includes one or more
iterations on each of these tabs. Once you are finished, you might want
to store the results of your analysis in different formats.
With pcaExplorer
you can do all of the following:
.RData
file, as if it was a workspace (clicking on
the cog icon in the right side of the task menu)pcaExplorer
and save” saves the state but
in a specific environment of your R session, which you can later access
by its name, which normally could look like
pcaExplorerState_YYYYMMDD_HHMMSS
(also accessible from the
cog)pcaExplorer
comes with a template
analysis, that picks the latest status of the app during your session,
and combines these reactive values together in a R Markdown document,
which you can first preview live in the app, and then download as
standalone HTML file - to store or share. This document stiches together
narrative text, code, and output objects, and constitutes a compendium
where all actions are recorded. If you are familiar with R, you can edit
that live, with support for autocompletion, in the “Edit report”
tab.The functionality to display the report preview is based on
knit2html
, and some elements such as DataTable
objects might not render correctly. To render them correctly, please
install the PhantomJS executable before launching the app. This can be
done by using the webshot
package and calling webshot::install_phantomjs()
- HTML
widgets will be rendered automatically as screenshots. Alternatively,
the more recent webshot2
package uses the headless Chrome browser (via the chromote
package, requiring Google Chrome or other Chromium-based browser). Keep
in mind that the fully rendered report (the one you can obtain with the
“Generate & Save” button) is not affected by this, since it uses
rmarkdown::render()
.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] org.Hs.eg.db_3.20.0 AnnotationDbi_1.69.0
## [3] DESeq2_1.47.1 airway_1.26.0
## [5] SummarizedExperiment_1.37.0 GenomicRanges_1.59.1
## [7] GenomeInfoDb_1.43.2 IRanges_2.41.2
## [9] S4Vectors_0.45.2 MatrixGenerics_1.19.1
## [11] matrixStats_1.5.0 pcaExplorer_3.1.1
## [13] Biobase_2.67.0 BiocGenerics_0.53.3
## [15] generics_0.1.3 knitr_1.49
## [17] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] fs_1.6.5 bitops_1.0-9 enrichplot_1.27.4
## [4] fontawesome_0.5.3 httr_1.4.7 webshot_0.5.5
## [7] RColorBrewer_1.1-3 doParallel_1.0.17 Rgraphviz_2.51.0
## [10] tools_4.4.2 R6_2.5.1 DT_0.33
## [13] lazyeval_0.2.2 mgcv_1.9-1 withr_3.0.2
## [16] prettyunits_1.2.0 gridExtra_2.3 cli_3.6.3
## [19] TSP_1.2-4 labeling_0.4.3 sass_0.4.9
## [22] topGO_2.59.0 genefilter_1.89.0 goseq_1.59.0
## [25] Rsamtools_2.23.1 yulab.utils_0.1.9 gson_0.1.0
## [28] txdbmaker_1.3.1 DOSE_4.1.0 R.utils_2.12.3
## [31] AnnotationForge_1.49.0 limma_3.63.3 RSQLite_2.3.9
## [34] GOstats_2.73.0 gridGraphics_0.5-1 BiocIO_1.17.1
## [37] crosstalk_1.2.1 dplyr_1.1.4 dendextend_1.19.0
## [40] GO.db_3.20.0 Matrix_1.7-1 abind_1.4-8
## [43] R.methodsS3_1.8.2 lifecycle_1.0.4 yaml_2.3.10
## [46] qvalue_2.39.0 SparseArray_1.7.4 BiocFileCache_2.15.1
## [49] grid_4.4.2 blob_1.2.4 promises_1.3.2
## [52] crayon_1.5.3 shinydashboard_0.7.2 ggtangle_0.0.6
## [55] lattice_0.22-6 cowplot_1.1.3 GenomicFeatures_1.59.1
## [58] annotate_1.85.0 KEGGREST_1.47.0 sys_3.4.3
## [61] maketools_1.3.1 pillar_1.10.1 fgsea_1.33.2
## [64] rjson_0.2.23 codetools_0.2-20 fastmatch_1.1-6
## [67] glue_1.8.0 ggfun_0.1.8 data.table_1.16.4
## [70] vctrs_0.6.5 png_0.1-8 treeio_1.31.0
## [73] gtable_0.3.6 assertthat_0.2.1 cachem_1.1.0
## [76] xfun_0.50 S4Arrays_1.7.1 mime_0.12
## [79] survival_3.8-3 pheatmap_1.0.12 seriation_1.5.7
## [82] iterators_1.0.14 statmod_1.5.0 Category_2.73.0
## [85] nlme_3.1-166 ggtree_3.15.0 bit64_4.6.0-1
## [88] threejs_0.3.3 progress_1.2.3 filelock_1.0.3
## [91] bslib_0.8.0 colorspace_2.1-1 DBI_1.2.3
## [94] tidyselect_1.2.1 bit_4.5.0.1 compiler_4.4.2
## [97] curl_6.1.0 httr2_1.1.0 graph_1.85.1
## [100] BiasedUrn_2.0.12 SparseM_1.84-2 xml2_1.3.6
## [103] DelayedArray_0.33.4 plotly_4.10.4 rtracklayer_1.67.0
## [106] scales_1.3.0 mosdef_1.3.1 RBGL_1.83.0
## [109] NMF_0.28 rappdirs_0.3.3 stringr_1.5.1
## [112] digest_0.6.37 shinyBS_0.61.1 rmarkdown_2.29
## [115] ca_0.71.1 XVector_0.47.2 htmltools_0.5.8.1
## [118] pkgconfig_2.0.3 base64enc_0.1-3 dbplyr_2.5.0
## [121] fastmap_1.2.0 rlang_1.1.5 htmlwidgets_1.6.4
## [124] UCSC.utils_1.3.1 shiny_1.10.0 farver_2.1.2
## [127] jquerylib_0.1.4 jsonlite_1.8.9 BiocParallel_1.41.0
## [130] GOSemSim_2.33.0 R.oo_1.27.0 RCurl_1.98-1.16
## [133] magrittr_2.0.3 GenomeInfoDbData_1.2.13 ggplotify_0.1.2
## [136] patchwork_1.3.0 munsell_0.5.1 Rcpp_1.0.14
## [139] ape_5.8-1 viridis_0.6.5 stringi_1.8.4
## [142] MASS_7.3-64 plyr_1.8.9 parallel_4.4.2
## [145] ggrepel_0.9.6 Biostrings_2.75.3 splines_4.4.2
## [148] hms_1.1.3 geneLenDataBase_1.42.0 locfit_1.5-9.10
## [151] igraph_2.1.3 rngtools_1.5.2 buildtools_1.0.0
## [154] reshape2_1.4.4 biomaRt_2.63.0 XML_3.99-0.18
## [157] evaluate_1.0.3 BiocManager_1.30.25 foreach_1.5.2
## [160] tweenr_2.0.3 httpuv_1.6.15 tidyr_1.3.1
## [163] purrr_1.0.2 polyclip_1.10-7 heatmaply_1.5.0
## [166] ggplot2_3.5.1 gridBase_0.4-7 ggforce_0.4.2
## [169] xtable_1.8-4 restfulr_0.0.15 tidytree_0.4.6
## [172] later_1.4.1 viridisLite_0.4.2 tibble_3.2.1
## [175] clusterProfiler_4.15.1 aplot_0.2.4 memoise_2.0.1
## [178] registry_0.5-1 GenomicAlignments_1.43.0 cluster_2.1.8
## [181] GSEABase_1.69.0 shinyAce_0.4.3