This document explains the functionalities available in the esetVis package.
This package contains wrapper functions for three types of visualization: spectral map(P.J. 1976), tsne(van der Maaten 2008) and linear discriminant analysis(Fisher 1936) for data contained in a expressionSet Bioconductor (or an SummarizedExperiment) object. The visualizations are available in two types: static, via the ggplot2 package or interactive, via the ggvis or plotly packages.
The ALL dataset contains microarray results from 128
patients with acute lymphoblastic leukemia (ALL). The data is contained
in a Bioconductor ExpressionSet
object. Extra gene
annotation is added to the object, via the annotation package
hgu95av2.db
.
library(ALL)
data(ALL)
# to get gene annotation from probe IDs (from the paper HGU95aV2 gene chip was used)
library("hgu95av2.db")
library("AnnotationDbi")
probeIDs <- featureNames(ALL)
geneInfo <- AnnotationDbi::select(hgu95av2.db, probeIDs,
c("ENTREZID", "SYMBOL", "GENENAME"), "PROBEID")
# 482 on the 12625 probe IDs don't have ENTREZ ID/SYMBOL/GENENAME
# remove genes with duplicated annotation: 1214
geneInfoWthtDuplicates <- geneInfo[!duplicated(geneInfo$PROBEID), ]
# remove genes without annotation: 482
genesWthtAnnotation <- rowSums(is.na(geneInfoWthtDuplicates)) > 0
geneInfoWthtDuplicatesAndWithAnnotation <- geneInfoWthtDuplicates[!genesWthtAnnotation, ]
probeIDsWithAnnotation <- featureNames(ALL)[featureNames(ALL) %in%
geneInfoWthtDuplicatesAndWithAnnotation$PROBEID]
ALL <- ALL[probeIDsWithAnnotation, ]
fData <- geneInfoWthtDuplicatesAndWithAnnotation[
match(probeIDsWithAnnotation, geneInfoWthtDuplicatesAndWithAnnotation$PROBEID), ]
rownames(fData) <- probeIDsWithAnnotation
fData(ALL) <- fData
# grouping variable: B = B-cell, T = T-cell
groupingVariable <- pData(ALL)$BT
# create custom palette
colorPalette <- c("dodgerblue",
colorRampPalette(c("white","dodgerblue2", "darkblue"))(5)[-1],
"red", colorRampPalette(c("white", "red3", "darkred"))(5)[-1])
names(colorPalette) <- levels(groupingVariable)
color <- groupingVariable; levels(color) <- colorPalette
# reformat type of the remission
remissionType <- ifelse(is.na(ALL$remission), "unknown", as.character(ALL$remission))
ALL$remissionType <- factor(remissionType,
levels = c("unknown", "CR", "REF"),
labels = c("unknown", "achieved", "refractory"))
Following tables detail some sample and gene annotation contained in
the ALL ExpressionSet
used in the vignette.
cod | sex | age | BT | remissionType | |
---|---|---|---|---|---|
01005 | 1005 | M | 53 | B2 | achieved |
01010 | 1010 | M | 19 | B2 | achieved |
03002 | 3002 | F | 52 | B4 | achieved |
04006 | 4006 | M | 38 | B1 | achieved |
04007 | 4007 | M | 57 | B2 | achieved |
04008 | 4008 | M | 17 | B1 | achieved |
PROBEID | ENTREZID | SYMBOL | GENENAME | |
---|---|---|---|---|
1000_at | 1000_at | 5595 | MAPK3 | mitogen-activated protein kinase 3 |
1001_at | 1001_at | 7075 | TIE1 | tyrosine kinase with immunoglobulin like and EGF like domains 1 |
1002_f_at | 1002_f_at | 1557 | CYP2C19 | cytochrome P450 family 2 subfamily C member 19 |
1003_s_at | 1003_s_at | 643 | CXCR5 | C-X-C motif chemokine receptor 5 |
1004_at | 1004_at | 643 | CXCR5 | C-X-C motif chemokine receptor 5 |
1005_at | 1005_at | 1843 | DUSP1 | dual specificity phosphatase 1 |
The functions of the package also supports object of class: SummarizedExperiment
.
Note: In this case, the functions fData
,
pData
, exprs
should be replaced by their
corresponding functions rowData
, colData
and
assay
.
esetSpectralMap
The function esetSpectralMap
creates a spectral
map(P.J. 1976) for the dataset.
Some default parameters are set, e.g. to print the top 10 samples and
top 10 genes, to display the first two dimensions of the analysis…
The resulting biplot contains two components:
Here is an example for the ALL dataset, with the default parameters.
Several annotation variables are available in the
eSet
.
Four different aesthetics [aes]
can be used to display
these variables:
color
alpha
size
shape
For each of this aesthetic [aes]
, several parameters are
available:
[aes]Var
: name of the column of the
phenoData
of the eSet
used for the aesthetic,
i.e. colorVar
[aes]
: palette/specified shape/size used for the
aesthetic, i.e. color
Custom size and the transparency (variables sizeVar
and
alphaVar
) can be specified:
sizeRange
and
alphaRange
size
and
alpha
argumentsIn the example, the type and stage of the disease (variable
BT) is used for coloring, the remission type for the
transparency, the sex for the shape and the age for
the size of the points. A custom color palette is specified via the
color
argument.
print(esetSpectralMap(eset = ALL,
title = "Acute lymphoblastic leukemia dataset \n Spectral map \n Sample annotation (1)",
colorVar = "BT", color = colorPalette,
shapeVar = "sex",
sizeVar = "age", sizeRange = c(0.1, 6),
alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9),
topSamples = 0, topGenes = 0, cloudGenes = TRUE))
Just for the demonstration, another visualization of the same
dataset, using this time a continuous variable: age for
coloring and transparency, a factor for the size and BT for the
shape. Because the output is a ggplot
object, any
additional customization not implemented via specific parameters, the
plot can be modified with additional ggplot2
functions,
e.g. with the color of the gradient.
gg <- esetSpectralMap(eset = ALL,
title = "Acute lymphoblastic leukemia dataset \n Spectral map \n Sample annotation (2)",
colorVar = "age",
shapeVar = "BT", shape = 15+1:nlevels(ALL$BT),
sizeVar = "remissionType", size = c(2, 4, 6),
alphaVar = "age", alphaRange = c(0.2, 0.8),
topSamples = 0, topGenes = 0, cloudGenes = TRUE)
# change color of gradient
library(ggplot2)
gg <- gg + scale_color_gradientn(colours =
colorRampPalette(c("darkgreen", "gold"))(2)
)
print(gg)
Several parameters related to the gene visualization are available:
psids
cloudGenes
cloudGenesNBins
cloudGenesColor
cloudGenesIncludeLegend
cloudGenesTitleLegend
The spectral map is done only on the subset of the genes, with the number of bins, color, and legend title modified.
print(esetSpectralMap(eset = ALL,
psids = 1:1000,
title = "Acute lymphoblastic leukemia dataset \n Spectral map \n Custom cloud genes",
topSamples = 0, topGenes = 0,
cloudGenes = TRUE, cloudGenesColor = "red", cloudGenesNBins = 50,
cloudGenesIncludeLegend = TRUE, cloudGenesTitleLegend = "number of features"))
Three different kind of elements can be labelled in the plot: genes, samples and gene sets.
For each [element]
, several parameters are
available:
top[Elements]
: number of elements to labeltop[Elements]Var
(not available for gene sets): column
of the corresponding element in the eSet
used for
labelling, in phenoData
for sample and
featureData
for gene. If not specified, the feature/sample
names of the eSet
are usedtop[Elements]Just
: label justificationtop[Elements]Cex
: label sizetop[Elements]Color
: label colorThe distance (sum of squared coordinates) of the gene/sample/gene set to the origin of the plot is used to rank the elements, and to extract the top ‘outlying’ ones.
By default (and if installed), the package ggrepel
is used for text labelling (as in this vignette), to avoid overlapping
labels. The text labels can also be displayed with the ggplot2
by setting the parameter packageTextLabel
to
ggplot2
(default ggrepel
).
In the example, the top genes are labelled with gene symbols (SYMBOL column of the phenoData of the eSet), and the top samples with patient identifiers (cod column of the phenoData of the eSet).
print(esetSpectralMap(eset = ALL,
title = paste("Acute lymphoblastic leukemia dataset \n",
"Spectral map \n Label outlying samples and genes"),
colorVar = "BT", color = colorPalette,
shapeVar = "sex",
sizeVar = "age", sizeRange = c(2, 6),
alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9),
topGenes = 10, topGenesVar = "SYMBOL",
topGenesCex = 2, topGenesColor = "darkgrey",
topSamples = 15, topSamplesVar = "cod", topSamplesColor = "chocolate4",
topSamplesCex = 3
))
Genes can be grouped into biologically meaningful gene sets, which can be labelled in the biplot.
Compared to previous section, two additional parameters are available:
geneSets
(required): list of gene
sets. Each element in the list should contain genes identifiers, and the
list should be namedgeneSetsVar
: column of the featureData of the eSet used
to map the gene identifiers contained in the geneSets
objectgeneSetsMaxNChar
: number of characters used in gene
sets labelsThe geneSets
can be created with the
getGeneSetsForPlot
function (wrapper around the
getGeneSets
function of the MLP
package), which can extract gene sets available in the Gene Ontology
(Biological Process, Molecular Function and Cellular Component) and KEGG
databases. Custom gene set lists can also be provided.
In the following example, only the pathways from the GOBP database are used.
geneSets <- getGeneSetsForPlot(
entrezIdentifiers = fData(ALL)$ENTREZID,
species = "Human",
geneSetSource = 'GOBP',
useDescription = TRUE)
Then this list of gene sets is provided to the
esetSpectralMap
function.
print(esetSpectralMap(eset = ALL,
title = "Acute lymphoblastic leukemia dataset \n Spectral map \n Gene set annotation",
colorVar = "BT", color = colorPalette,
shapeVar = "sex",
sizeVar = "age", sizeRange = c(2, 6),
alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9),
topGenes = 0,
topGeneSets = 4, geneSets = geneSets, geneSetsVar = "ENTREZID", geneSetsMaxNChar = 30,
topGeneSetsCex = 2, topGeneSetsColor = "purple"))
Note: because of the inherent hierarchical structure of the Gene Ontology database, sets of genes can be very similar, which can result in close (even overlapping) labels in the visualization.
In all previous plots, only the first dimensions of the principal
component analysis were visualized, this can be specified via the
dim
parameter. The third and fourth dimensions are
visualized in the next Figure. This parameter is only available for the
spectral map visualization.
The function uses the mpm
and plot.mpm
function from the mpm
package. Some default parameters are
set for these two functions, they can be changed via the
mpm.args
and plot.mpm.args
arguments. For
further details, refer to the documentation of these two functions.
Note: One important argument is logtrans
in
mpm
function, which indicates if the data should be
first log-transformed before the computation. It is set by
default to FALSE, assuming that the data are already in the log
scale (it is the case for the ALL dataset).
All plots available in the esetVis
package can be
static or interactive.
The argument typePlot
can be set respectively to
static
(by default), in this case ggplot2
is
used, or interactive
, in this case either the
ggvis
or plotly
package is used.
Two functionalities of the interactive plot can be of interest for such high-dimensional data:
phenoData
) displayed in the hover can be specified via the
interactiveTooltipExtraVars
parameter.An interactive version of the spectral map of the previous section is created.
esetSpectralMap(
eset = ALL,
title = paste("Acute lymphoblastic leukemia dataset - spectral map"),
colorVar = "BT", color = unname(colorPalette),
shapeVar = "sex",
alphaVar = "remissionType",
typePlot = "interactive", packageInteractivity = "plotly",
figInteractiveSize = c(700, 600),
topGenes = 10, topGenesVar = "SYMBOL",
topSamples = 10,
# use all sample variables for hover
interactiveTooltipExtraVars = varLabels(ALL)
)
Note: as ggvis
plot requires to have a R session
running, only a static version of the plot is included.
library(ggvis)
# embed a static version of the plot
esetSpectralMap(
eset = ALL,
title = paste("Acute lymphoblastic leukemia dataset - spectral map"),
colorVar = "BT", color = unname(colorPalette),
shapeVar = "sex",
alphaVar = "remissionType",
typePlot = "interactive", packageInteractivity = "ggvis",
figInteractiveSize = c(700, 600),
sizeVar = "age", sizeRange = c(2, 6),
# use all phenoData variables for hover
interactiveTooltipExtraVars = varLabels(ALL)
)
esetTsne
Another unsupervised visualization is available in the package:
t-Distributed Stochastic Neighbor Embedding
(tsne). The function esetTnse
uses the
Rtsne
function of the same package.
Most of the previous parameters discussed for the spectral map are available for this visualization, at the exception of:
Rtsne
functiondim
Arguments to the Rtsne
function can be specified via the
Rtsne.args
argument.
Here is an example of tsne, for the same dataset and same annotation/labelling.
The tsne can be quite time-consuming, especially for big data. As the
Rtsne
function used in the background can also uses an
object of class dist
, the data can be pre-transformed
before running the tsne
analysis. The argument
fctTransformDataForInputTsne
enables to specify a custom
function to pre-transform the data.
print(esetTsne(eset = ALL,
title = "Acute lymphoblastic leukemia dataset \n Tsne",
colorVar = "BT", color = colorPalette,
shapeVar = "sex",
sizeVar = "age", sizeRange = c(2, 6),
alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9),
topSamplesVar = "cod",
fctTransformDataForInputTsne =
function(mat) as.dist((1 - cor(mat))/2)
))
esetLda
Another visualization, this time semi-supervised (as a variable is
used for the computation), is included: Linear Discriminant
Analysis. This uses the lda
function from the
MASS
package.
This function maximizes the variance between levels of a factor, here
describing some sample annotation, specified via the ldaVar
argument.
As this analysis can be quite time consuming, for the demonstration, the analysis is run only a random feature subset of the data.
The returnAnalysis
parameter can be used, to extract the
analysis which will be used as input for the
esetPlotWrapper
function, used in the background. It is
available also for the esetSpectralMap
and
esetTnse
functions.
# extract random features, because analysis is quite time consuming
retainedFeatures <- sample(featureNames(ALL), size = floor(nrow(ALL)/5))
# run the analysis
outputEsetLda <- esetLda(eset = ALL[retainedFeatures, ], ldaVar = "BT",
title = paste("Acute lymphoblastic leukemia dataset \n",
"Linear discriminant analysis on BT variable \n",
"(Subset of the feature data)"),
colorVar = "BT", color = colorPalette,
shapeVar = "sex",
sizeVar = "age", sizeRange = c(2, 6),
alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9),
topSamplesVar = "cod", topGenesVar = "SYMBOL",
returnAnalysis = TRUE
)
# extract and print the ggplot object
print(outputEsetLda$plot)
The top elements (here genes and samples) labelled in the plot can be
accessed via the topElements
slot of the object. These are
labelled with the identifiers used in the plot and named with
sample/gene identifiers of the eset
.
# extract top elements labelled in the plot
pander(t(data.frame(topGenes = outputEsetLda$topElements[["topGenes"]])))
38319_at | 33238_at | 38095_i_at | 39389_at | 40570_at | |
---|---|---|---|---|---|
topGenes | CD3D | LCK | HLA-DPB1 | CD9 | FOXO1 |
33316_at | 39709_at | 1498_at | 34741_at | 38051_at | |
---|---|---|---|---|---|
topGenes | TOX | SELENOW | ZAP70 | TFDP2 | MAL |
01005 | 16007 | 63001 | 26008 | 24005 | 12007 | 28008 | 04008 | |
---|---|---|---|---|---|---|---|---|
topSamples | 1005 | 16007 | 63001 | 26008 | 24005 | 12007 | 28008 | 4008 |
49006 | 04006 | |
---|---|---|
topSamples | 49006 | 4006 |
When returnAnalysis
is set to TRUE, the output of the
function can be used as input to the esetPlotWrapper
function, and extra parameters can then be modified.
Here the variable used for the shape of the points for the samples is changed to type of remission (remissionType column).
# to change some plot parameters
esetPlotWrapper(
dataPlotSamples = outputEsetLda$analysis$dataPlotSamples,
dataPlotGenes = outputEsetLda$analysis$dataPlotGenes,
esetUsed = outputEsetLda$analysis$esetUsed,
title = paste("Acute lymphoblastic leukemia dataset \n",
"Linear discriminant analysis on BT variable (2) \n",
"(Subset of the feature data)"),
colorVar = "BT", color = colorPalette,
shapeVar = "remissionType",
sizeVar = "age", sizeRange = c(2, 6),
alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9),
topSamplesVar = "cod", topGenesVar = "SYMBOL"
)
From the previous visualization (obtained on a subset of the genes), the biggest difference between all levels of the type/stage of the disease seems to reside between all B-cells (tagged B) and T-cells (tagged T). It might be interesting to focus on a subset of the data, e.g. only one cell type.
# keep only 'B-cell' samples
ALLBCell <- ALL[, grep("^B", ALL$BT)]
ALLBCell$BT <- factor(ALLBCell$BT)
colorPaletteBCell <- colorPalette[1:nlevels(ALLBCell$BT )]
print(esetLda(eset = ALLBCell[retainedFeatures, ], ldaVar = "BT",
title = paste("Acute lymphoblastic leukemia dataset \n",
"Linear discriminant analysis on BT variable \n B-cell only",
"(Subset of the feature data)"
),
colorVar = "BT", color = colorPaletteBCell,
shapeVar = "sex",
sizeVar = "age", sizeRange = c(2, 6),
alphaVar = "remissionType", alpha = c(0.3, 0.6, 0.9),
topSamplesVar = "cod", topGenesVar = "SYMBOL",
))
The subcell type which seems to differ the most within all B-cell type is the first one: B1 (most of these samples at the right side of the plot).