Data quality assessment is an integral part of preparatory data
analysis to ensure sound biological information retrieval. We present
here the MatrixQCvis
package, which provides shiny-based
interactive visualization of data quality metrics at the per-sample and
per-feature level. It is broadly applicable to quantitative omics data
types that come in matrix-like format (features x samples). It enables
the detection of low-quality samples, drifts, outliers and batch effects
in data sets. Visualizations include amongst others bar- and violin
plots of the (count/intensity) values, mean vs standard deviation plots,
MA plots, empirical cumulative distribution function (ECDF) plots,
visualizations of the distances between samples, and multiple types of
dimension reduction plots. MatrixQCvis
builds upon the
Bioconductor SummarizedExperiment
S4 class and enables thus
the facile integration into existing workflows.
MatrixQCvis
is especially addressed to analyze the
quality of proteomics and metabolomics data sets that are characterized
by missing values as it allows the user for imputation of missing values
and differential expression analysis using the proDA
package (Ahlman-Eltze and Anders 2019).
Besides this, MatrixQCvis
is extensible to other type of
data(e.g. transcriptomics count data) that can be represented as a
SummarizedExperiment
object. Furthermore, the
shiny
application facilitates simple differential
expression analysis using either moderated t-tests (from the
limma
package, Ritchie et al.
(2015)) or Wald tests (from the proDA
package, Ahlman-Eltze and Anders (2019)).
Within this vignette, the term feature will refer to a probed molecular entity, e.g. gene, transcript, protein, peptide, or metabolite.
In the following, we will describe the major setup of
MatrixQCvis
and the navigation through the shiny
application, shinyQC
.
MatrixQCvis
is currently under active development. If
you discover any bugs, typos or develop ideas of improving
MatrixQCvis
feel free to raise an issue via GitHub or send a mail
to the developer.
To install MatrixQCvis
enter the following to the
R
console
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("MatrixQCvis")
Before starting with the analysis, load the MatrixQCvis
package. This will also load the required packages Biobase
,
BiocGenerics
, GenomeInfoDb
,
GenomicRanges
, ggplot2
, IRanges
,
MatrixGenerics
, parallel
,
matrixStats
, plotly
, shiny
,
SummarizedExperiment
, and stats4
.
Please note: Depending on the supplied
SummarizedExperiment
object the user interface of
shinyQC
will differ:
SummarizedExperiment
object containing missing
values
Samples
, Measured Values
,
Missing Values
, Values
,
Dimension Reduction
and DE
will be
displayed,Values
and
Dimension Reduction
the imputed
data set
will be visualized,SummarizedExperiment
object containing no missing
values (i.e. with complete observations)
Samples
, Values
,
Dimension Reduction
and DE
will be
displayed,Values
and DE
the
imputed
data set will not be
visualized,In the following, the vignette will be (mainly) described from the
point of view of a SummarizedExperiment
containing no
missing values (RNA-seq dataset) and missing values (proteomics
dataset).
Here, we will retrieve a SummarizedExperiment
,
se
, object
from the ExperimentHub
package. The dataset (GEO accession
GSE62944) contains 741 normal samples across 24 cancer types from the
TCGA re-processed RNA-seq data. We will use the dataset obtained by
ExperimentHub
to showcase the functionality of
shinyQC
.
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
##
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:Biobase':
##
## cache
## Assuming valid proxy connection through ':1'
## If you experience connection issues consider using 'localHub=TRUE'
## the SummarizedExperiment object has the title "RNA-Sequencing and clinical
## data for 741 normal samples from The Cancer Genome Atlas"
eh[eh$title == "RNA-Sequencing and clinical data for 741 normal samples from The Cancer Genome Atlas"]
## ExperimentHub with 1 record
## # snapshotDate(): 2024-10-24
## # names(): EH1044
## # package(): GSE62944
## # $dataprovider: GEO
## # $species: Homo sapiens
## # $rdataclass: SummarizedExperiment
## # $rdatadateadded: 2017-10-29
## # $title: RNA-Sequencing and clinical data for 741 normal samples from The C...
## # $description: TCGA RNA-seq Rsubread-summarized raw count data for 741 norm...
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: tar.gz
## # $sourceurl: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944
## # $sourcesize: NA
## # $tags: c("ExperimentData", "Genome", "DNASeqData", "RNASeqData")
## # retrieve record with 'object[["EH1044"]]'
## GSE62944 not installed.
## Full functionality, documentation, and loading of data might not be possible without installing
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## here we will restrain the analysis on 40 samples and remove the features
## that have a standard deviation of 0
se <- se[, seq_len(40)]
se_sds <- apply(assay(se), 1, sd, na.rm = TRUE)
se <- se[!is.na(se_sds) & se_sds > 0, ]
The most important function to assess the data quality is the
shinyQC
function and its most important argument is
se
. shinyQC
expects a
SummarizedExperiment
object.
The shinyQC
function sets the following requirements to
the SummarizedExperiment
object se
: -
rownames(se)
are not allowed to be NULL
and
have to be set to the feature names, - colnames(se)
are not
allowed to be NULL
and have to be set to the sample names,
- colnames(se)
, colnames(assay(se))
and
rownames(colData(se))
all have to be identical.
If these requirements are not met, shinyQC
will stop and
throw an error.
Alternatively, a SummarizedExperiment
object can also be
loaded from within shinyQC
(when no se
object
is supplied to shinyQC
).
Objects belonging to the SummarizedExperiment
class are
containers for one or more assays, which are (numerical) matrices
containing the quantitative, measured information of the experiment. The
rows represent features of interest (e.g. transcripts, peptides,
proteins, or metabolites) and the columns represent the samples. The
SummarizedExperiment
object stores also information on the
features of interest (accessible by rowData
) and
information on the samples (accessible by colData
). The
name of samples and features will be accessed from
colnames(se)
and rownames(se)
,
respectively.
If there is more than one experimental data set (assay
)
stored in the SummarizedExperiment
object, a select option
will appear in the sidebar allowing the user to select the
assay
.
The actual shiny application can then be started by entering the
following to the R
console:
The assignment to qc
or any other object is not
mandatory. Upon exiting the shiny application, shinyQC
will
return a SummarizedExperiment
object containing the imputed
dataset that can be in the following further analyzed. The object will
only be returned if the function call was assigned to an object, e.g.
qc <- shinyQC(se)
.
Now, we will have a closer look on the user interface of the Shiny application.
The tab Samples
gives general information on the number
of samples in the se
object.
The first panel shows a barplot and displays the number of samples
per sample type, treatment, etc. As an example, if we want to display
how many samples are in se
for the different
type
s (type
is a column name in
colData(se)
and any column in colData(se)
can
be selected), this panel will show the following output:
The figure in this panel displays the relative proportions of the
numbers, e.g. how many samples (in %) are there for type
against arbitrary_values
. As the dataset was only shipped
with a type
and sample
column, for
demonstration, we manually added the column
arbitrary_levels
to the se
object. This column
is filled with the the values "A"
and "B"
.
Again, type
and arbitrary_values
are
columns in colData(se)
and any column in
colData(se)
can be selected to create the Mosaic plot.
The figure will tell us to what extent the se
contains
the different types in a balanced manner depending on
arbitrary_values
:
The tab Values
will take a closer look on the
assay
slot of the SummarizedExperiment
.
This panel shows the (count/intensity) values for raw
(raw
), batch corrected (normalized
),
normalized+batch corrected (batch corrected
),
normalized+batch corrected+transformed (transformed
), and
normalized+batch corrected+transformed+imputed (imputed
)
(count/intensity) values (imputation of missing values,
imputed
will only be shown if there are missing values in
the SummarizedExperiment
). As already mentioned, the
different methods for normalization, batch correction, transformation
(and imputation) are specified in the side panel.
For visualization purposes only, the (count/intensity) values for the
raw, normalized and batch corrected data sets can be log2
transformed (see the radio buttons in
Display log2 values? (only for 'raw', 'normalized' and 'batch corrected')
).
The type of visualization (boxplot or violin plot) can be specified
by selecting boxplot
or violin
in the radio
button panel (Type of display
).
The figure (violin plot) using the raw values will look like (log set
to TRUE
)
This panel shows a trend line for aggregated values to indicate
drifts/trends in data acquisition. It shows the sum- or
median-aggregated values (specified in Select aggregation
).
The plot displays trends in data acquisition that originate e.g. from
differences in instrument sensitivity. The panel displays aggregated
values for raw (raw
), batch corrected
(batch corrected
), batch corrected+normalized
(normalized
), batch corrected+normalized+transformed
(transformed
), and batch
corrected+normalized+transformed+imputed (imputed
)
(count/intensity) values (imputation of missing values,
imputed
will only be shown if there are missing values in
the SummarizedExperiment
). The different methods for
normalization, batch correction, transformation (and imputation) are
specified in the sidebar panel.
The smoothing is calculated from the selection of samples that are
specified by the drop-down menus Select variable
and
Select level to highlight
. The menu
Select variable
corresponds to the colnames
in
colData(se)
. Here, we can select for the higher-order
variable, e.g. the type (containing for example BLCA
,
BRCA
, etc.). The drop-down menu
Select level to highlight
will specify the actual selection
from which the trend line will be calculated (e.g. BLCA
,
BRCA
, etc.). Also, the menu will always include the level
all
, which will use all points to calculate the trend line.
If we want to calculate the trend line of aggregated values of all
samples belonging to the type QC
, we select QC
in the drop-down menu.
The panel allows the users for further customization after expanding
the collapsed box. The data input is selected in the drop-down menu
under Select data input
. The smoothing method (either LOESS
or linear model) is selected in the drop-down menu under
Select smoothing method
. The aggregation method is selected
in the drop-down menu Select smoothing method
.
With the drop-down menu
Select categorical variable to order samples
, the samples
(x-axis) will be ordered alphanumerically according to the selected
level (and the sample name).
Here, we are interested in observing if there is a trend/drift for
samples of type
BRCA
. We select
LOESS
as the method for the trend line and
median
as the aggregation method. The figure will then look
as follows:
This panel shows the coefficient of variation values for raw
(raw
), normalized (normalized
),
normalized+batch corrected (batch corrected
),
normalized+batch corrected+transformed (transformed
), and
normalized+batch corrected+transformed+imputed (imputed
)
(count/intensity) values (imputation of missing values,
imputed
will only be shown if there are missing values in
the SummarizedExperiment
) among the samples. The different
methods for normalization, batch corrected, transformation, (and
imputation) are specified in the sidebar panel.
The panel displays the coefficient of variation values from the
samples of the SummarizedExperiment
object. The
coefficients of variation are calculated according to the formula
sd(x) / mean(x) * 100
with x
the sample values
and sd
the standard deviation. The plot might be useful
when looking at the coefficient of variation values of a specific sample
type (e.g. QCs) and trying to identify outliers.
Here, we shows the plot of coefficient of variation values from the
raw values (as obtained by assay(se)
), normalized values
(using sum
normalization), transformed values (using
vsn
), batch corrected values (using none
) and
imputed values (using the MinDet
algorithm, Lazar et al. (2016)).
The panel shows the three mean-sd (standard deviation) plots for
normalized+batch corrected+transformed (transformed
) and
normalized+batch corrected+transformed+imputed (imputed
)
values (imputed
will only be shown if there are missing
values in the SummarizedExperiment
). The sd and mean are
calculated feature-wise from the values of the respective data set. The
plot allows the user to visualize if there is a dependence of the sd on
the mean. The red line depicts the running median estimator
(window-width 10%). In case of sd-mean independence, the running median
should be approximately horizontal.
For the transformed values, the mean-sd plot will look like
The panel displays MA plots and Hoeffding’s D statistic.
In the first part of the panel the A
vs. M
plots per sample are depicted.
The values are defined as follows
where Ii and Ij are per
definition log2
-transformed values. In the case of
raw
, normalized
, or
batch corrected
the values are
log2
-transformed prior to calculating A
and
M
. In case of transformed
or
imputed
the values are taken as they are (N.B. when the
transformation method is set to none
the values are not
log2
-transformed).
The values for Ii are taken
from the sample i. For Ij, the
feature-wise means are calculated from the values of the
group
j of
samples specified by the drop-down menu group
. The sample
for calculating Ii is excluded
from the group j. The group
can be set to "all"
(i.e. all samples except sample i are used to calculate Ij) or any other
column in colData(se)
. For any group except
"all"
the group is taken to which the sample i belongs to and the sample i is excluded from the feature-wise
calculation.
The MA values for all samples are by default displayed facet-wise.
The MA plot can be set to a specific sample by changing the selected
value in the drop-down menu plot
.
The underlying data set can be selected by the drop-down menu
(Data set for the MA plot
).
In the second part of the panel, the Hoeffding’s D statistic values
are visualized for the different data sets raw
,
normalized
, batch corrected
,
transformed
, and imputed
(imputed
will only be shown if there are missing values in the
SummarizedExperiment
).
D
is a measure of the distance between
F(A, M)
and G(A)H(M)
, where
F(A, M)
is the joint cumulative distribution function (CDF)
of A
and M
, and G
and
H
are marginal CDFs. The higher the value of
D
, the more dependent are A
and
M
. The D
values are connected for the same
samples along the different data sets (when lines
is
selected), enabling the user to track the influence of normalization,
batch correction, transformation (and imputation) methods on the
D
values.
The MA plot using the raw values and group = "all"
will
look like the following (plot
= “Sample_1-1”`):
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_hex()`).
The plot shows the ECDF of the values of the sample i and the feature-wise mean values
of a group j of samples
specified by the drop-down menu Group
. The sample for
calculating Ii is excluded
from the group j. The group
can be set to "all"
(i.e. all samples except sample i are used to calculate Ij) or any other
column in colData(se)
. For any group except
"all"
the group is taken to which the sample i belongs to and the sample i is excluded from the feature-wise
calculation.
The underlying data set can be selected by the drop-down menu
Data set for the MA plot
. The sample i can be selected by the drop-down
menu Sample
. The group can be selected by the drop-down
menu Group
.
The ECDF plot for the sample
"TCGA-K4-A3WV-11A-21R-A22U-07"
, group = "all"
,
and the raw values (as obtained by assay(se)
) will look
like:
## Warning in ks.test.default(x = value_s, y = value_g, exact = NULL, alternative
## = "two.sided"): p-value will be approximate in the presence of ties
On the left, the panel depicts heatmaps of distances between samples
for the data sets of raw (raw
), normalized
(normalized
), normalized+batch corrected
(batch corrected
), normalized+batch corrected+transformed
(transformed
), and normalized+batch
corrected+transformed+imputed (imputed
,
imputed
will only be shown if there are missing values in
the SummarizedExperiment
). The annotation of the heatmaps
can be adjusted by the drop-down menu annotation
.
On the right panel the sum of distances to other samples is depicted.
The distance matrix and sum of distances (for raw values, as obtained
by assay(se)
) will look like:
The first plot shows the values for each samples along the data processing steps. The feature to be displayed is selected via the drop-down menu Select feature.
The second plot shows the coefficient of variation of the values for
all features in the data set along the data processing steps. The
features of same identity can be connected by lines by clicking on
lines
.
The element in the bottom of the tab panel
(Select features
) will specify the selection of features in
the data set (the selection will be propagated through all tabs):
all
: when selected, all features in the uploaded
SummarizedExperiment
object will used,exclude
: when selected, the features specified in the
text input field will be excluded from the uploaded
SummarizedExperiment
,select
: when selected, the features specified in the
text input field will be selected from the uploaded
SummarizedExperiment
(note: a minimum of three features
needs to be selected in order for the selection to take place).Within this tab several dimension reduction plots are displayed to
visualize the level of similarity of samples of a data set: principal
component analysis (PCA), principal coordinates analysis (PCoA, also
known as multidimensional scaling, MDS), non-metric multidimensional
scaling (NMDS), t-distributed stochastic neighbor embedding (tSNE, Maaten and Hinton (2008)), and uniform manifold
approximation and projection (UMAP, McInnes,
Healy, and Melville (2018)). Data input for the dimension
reduction plots is the imputed
data set.
The panel depicts a plot of PCA. The data input can be
scale
d and center
ed prior to calculating the
principal components by adjusting the respective tick marks. The
different PCs can be displayed by changing the values in the drop-down
menus for the x-axis and y-axis.
A scree plot, showing the explained variance per PC, is displayed in the right panel.
The panel depicts a plot of PCoA (= multidimensional scaling). Different distance measures can be used to calculate the distances between the samples (euclidean, maximum, manhattan, canberra, minkowski). The different axes of the transformed data can be displayed by changing the values in the drop-down menus for the x-axis and y-axis.
The panel depicts a plot of NMDS. Different distance measures can be used to calculate the distances between the samples (euclidean, maximum, manhattan, canberra, minkowski). The different axes of the transformed data can be displayed by changing the values in the drop-down menus for the x-axis and y-axis.
The panel depicts a plot of tSNE. The parameters
Perplexity
, Number of iterations
,
Number of retained dimensions in initial PCA
, and
Output dimensionality
required for the tSNE algorithm can
be set with the sliders. For the parameter
Number of retained dimensions in initial PCA
the panel
Principal components
can be employed that shows on the left
panel a Scree plot of the data set and permuted values and corresponding
p-values from the permutation (set the number of principal components
where the explained variance is above the permuted data set/where
p-values are below 0.05).
The different dimensions of the transformed data can be displayed by
changing the values in the drop-down menus for the x-axis and y-axis
(either two or three dimensions according to the
Output dimensionality
).
The panel depicts a plot of UMAP. The parameters
Minimum distance
, Number of neighbors
, and
Spread
, required for the UMAP algorithm can be set with the
sliders.
The different dimensions of the transformed data can be displayed by changing the values in the drop-down menus for the x-axis and y-axis.
This tab enables the user to test for differential expression between conditions.
Currently, two methods/tests are implemented for calculating
differential expression between conditions: moderated t-tests from
limma
and the Wald test from proDA
. The
approach of proDA
does not require imputed values and will
take the normalized+transformed+batch corrected
(batch corrected
) data set as input.
The moderated t-statistic is the ratio of the M-value to its standard
error, where the M-value is the log2-fold change for the feature of
interest. The moderated t-statistic has the same interpretation as an
ordinary t-statistic except that the standard errors have been moderated
across features, borrowing information from the ensemble of features to
aid with inference about each individual feature (Kammers et al. 2015). We use the
eBayes
function from limma
to compute the
moderated t-statistics (trend
and robust
are
set by default to TRUE
, see ?eBayes
for
further information).
proDA
(Ahlman-Eltze and Anders
2019) was developed for the differential abundance analysis in
label-free proteomics data sets. proDA
models missing
values in an intensity-dependent probabilistic manner based on a dropout
curve (for further details see Ahlman-Eltze and
Anders (2019)).
In the input field for levels, the formula for the levels is entered.
The formula has to start with a ~
(tilde) and the
R
-specific symbolic form:
+
to add terms,:
to denote an interaction term,*
to denote factor crossing (a*b
is
interpreted as a+b+a:b
), a
and b
are columns in colData(se)
,-
to remove the specified term,
e.g. ~ a - 1
to specify no intercept, and a
a
column in colData(se)
,+ 0
to alternatively specify a model without
intercept.The colnames
of colData
can be added as
terms.
The colnames
of the model matrix can be used to
calculate contrasts, e.g. a - b
to specify the contrast
between a
and b
. The contrasts can be
specified in the input field in the sidebar panel.
The panels Model matrix
and Contrast matrix
will show the model and contrast matrix upon correct specification of
the levels and contrasts. The panel Top DE
will show the
differential features in a tabular format, while the panel
Volcano plot
will display the information of log fold
change (for limma
) or difference (for proDA
)
against the p-values (displayed as -log10(p-value)
.
In the following, we will look at the different panels with the
se
input as specified above.
The panel Sample meta-data
will show the column data of
the se
object. The output will help to specify the levels
for the model matrix.
The output will look like the following (only the first few rows are shown here)
## DataFrame with 10 rows and 2 columns
## sample type
## <character> <factor>
## TCGA-K4-A3WV-11A-21R-A22U-07 TCGA-K4-A3WV-11A-21R.. BLCA
## TCGA-49-6742-11A-01R-1858-07 TCGA-49-6742-11A-01R.. LUAD
## TCGA-DD-A3A2-11A-11R-A213-07 TCGA-DD-A3A2-11A-11R.. LIHC
## TCGA-BH-A0DV-11A-22R-A12P-07 TCGA-BH-A0DV-11A-22R.. BRCA
## TCGA-77-8008-11A-01R-2187-07 TCGA-77-8008-11A-01R.. LUSC
## TCGA-BH-A0BC-11A-22R-A089-07 TCGA-BH-A0BC-11A-22R.. BRCA
## TCGA-91-6836-11A-01R-1858-07 TCGA-91-6836-11A-01R.. LUAD
## TCGA-BK-A4ZD-11A-12R-A27V-07 TCGA-BK-A4ZD-11A-12R.. UCEC
## TCGA-H6-A45N-11A-12R-A26U-07 TCGA-H6-A45N-11A-12R.. PAAD
## TCGA-FL-A1YG-11A-12R-A16F-07 TCGA-FL-A1YG-11A-12R.. UCEC
When entering ~ type + 0
into the input field
Select levels for Model Matrix
, the
Model matrix
panel will look like (only the first few rows
are shown here):
## typeBLCA typeBRCA typeCESC typeCHOL typeCOAD
## TCGA-K4-A3WV-11A-21R-A22U-07 1 0 0 0 0
## TCGA-49-6742-11A-01R-1858-07 0 0 0 0 0
## TCGA-DD-A3A2-11A-11R-A213-07 0 0 0 0 0
## TCGA-BH-A0DV-11A-22R-A12P-07 0 1 0 0 0
## TCGA-77-8008-11A-01R-2187-07 0 0 0 0 0
## TCGA-BH-A0BC-11A-22R-A089-07 0 1 0 0 0
## typeESCA typeGBM typeHNSC typeKICH typeKIRC
## TCGA-K4-A3WV-11A-21R-A22U-07 0 0 0 0 0
## TCGA-49-6742-11A-01R-1858-07 0 0 0 0 0
## TCGA-DD-A3A2-11A-11R-A213-07 0 0 0 0 0
## TCGA-BH-A0DV-11A-22R-A12P-07 0 0 0 0 0
## TCGA-77-8008-11A-01R-2187-07 0 0 0 0 0
## TCGA-BH-A0BC-11A-22R-A089-07 0 0 0 0 0
## typeKIRP typeLIHC typeLUAD typeLUSC typePAAD
## TCGA-K4-A3WV-11A-21R-A22U-07 0 0 0 0 0
## TCGA-49-6742-11A-01R-1858-07 0 0 1 0 0
## TCGA-DD-A3A2-11A-11R-A213-07 0 1 0 0 0
## TCGA-BH-A0DV-11A-22R-A12P-07 0 0 0 0 0
## TCGA-77-8008-11A-01R-2187-07 0 0 0 1 0
## TCGA-BH-A0BC-11A-22R-A089-07 0 0 0 0 0
## typePCPG typePRAD typeREAD typeSARC typeSKCM
## TCGA-K4-A3WV-11A-21R-A22U-07 0 0 0 0 0
## TCGA-49-6742-11A-01R-1858-07 0 0 0 0 0
## TCGA-DD-A3A2-11A-11R-A213-07 0 0 0 0 0
## TCGA-BH-A0DV-11A-22R-A12P-07 0 0 0 0 0
## TCGA-77-8008-11A-01R-2187-07 0 0 0 0 0
## TCGA-BH-A0BC-11A-22R-A089-07 0 0 0 0 0
## typeSTAD typeTHCA typeTHYM typeUCEC
## TCGA-K4-A3WV-11A-21R-A22U-07 0 0 0 0
## TCGA-49-6742-11A-01R-1858-07 0 0 0 0
## TCGA-DD-A3A2-11A-11R-A213-07 0 0 0 0
## TCGA-BH-A0DV-11A-22R-A12P-07 0 0 0 0
## TCGA-77-8008-11A-01R-2187-07 0 0 0 0
## TCGA-BH-A0BC-11A-22R-A089-07 0 0 0 0
As an example, we are interested in the differential expression
between the samples of typ BRCA
and LUAD
. We
enter typeBRCA-typeLUAD
in the input field
Select contast(s)
. The contrast matrix (only the first few
rows are shown here) will then look like:
## Contrasts
## Levels typeBRCA-typeLUAD
## typeBLCA 0
## typeBRCA 1
## typeCESC 0
## typeCHOL 0
## typeCOAD 0
## typeESCA 0
Switching to the panel Top DE
, we will obtain a table
with the differentially expressed features (normalization method is set
to none
, transformation method to none
and
imputation method to MinDet
). Here, differential expression
is tested via moderated t-tests (from the limma
package):
The output will be (only the first 20 rows will be shown here):
## Coefficients not estimable: typeCHOL typeESCA typeGBM typeKIRP typePCPG typeREAD typeSARC typeSKCM typeTHYM
The last panel of the tab DE
displays the information
from the differential expression analysis. In the case of moderated
t-tests, the plot shows the log fold changes between the specified
contrasts against the p-values. Using the afore-mentioned specification
for the model matrix and the contrast matrix, the plot will look
like:
We will retrieve a proteomics dataset from the
ExperimentHub
package. The dataset contains quantitative
profiling information of about 12000 proteins in 375 cell lines spanning
24 primary diseases and 27 lineages. The dataset was obtained by the
Gygilab at Harvard University. Within the ExperimentHub
,
the dataset is stored as a tibble
in long format. We will
convert it here to a SummarizedExperiment
class object.
## ExperimentHub with 1 record
## # snapshotDate(): 2024-10-24
## # names(): EH3459
## # package(): depmap
## # $dataprovider: Broad Institute
## # $species: Homo sapiens
## # $rdataclass: tibble
## # $rdatadateadded: 2020-05-19
## # $title: proteomic_20Q2
## # $description: Quantitative profiling of 12399 proteins in 375 cell lines, ...
## # $taxonomyid: 9606
## # $genome:
## # $sourcetype: CSV
## # $sourceurl: https://gygi.med.harvard.edu/sites/gygi.med.harvard.edu/files/...
## # $sourcesize: NA
## # $tags: c("ExperimentHub", "ExperimentData", "ReproducibleResearch",
## # "RepositoryData", "AssayDomainData", "CopyNumberVariationData",
## # "DiseaseModel", "CancerData", "BreastCancerData", "ColonCancerData",
## # "KidneyCancerData", "LeukemiaCancerData", "LungCancerData",
## # "OvarianCancerData", "ProstateCancerData", "OrganismData",
## # "Homo_sapiens_Data", "PackageTypeData", "SpecimenSource",
## # "CellCulture", "Genome", "Proteome", "StemCell", "Tissue")
## # retrieve record with 'object[["EH3459"]]'
## depmap not installed.
## Full functionality, documentation, and loading of data might not be possible without installing
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## reduce the number of files to speed computations up (can be skipped)
cell_line_unique <- dplyr::pull(tbl, "cell_line") |>
unique()
tbl <- tbl |>
filter(tbl[["cell_line"]] %in% cell_line_unique[seq_len(40)])
## using the information from protein_id/cell_line calculate the mean expression
## from the protein_expression values,
tbl_mean <- tbl |>
dplyr::group_by(protein_id, cell_line) |>
dplyr::summarise(mean_expression = mean(protein_expression, na.rm = TRUE),
.groups = "drop")
## create wide format from averaged expression data
tbl_mean <- tbl_mean |>
tidyr::pivot_wider(names_from = cell_line, values_from = mean_expression,
id_cols = protein_id, names_expand = TRUE)
## add additional information from tbl to tbl_mean, remove first columns
## protein_expression and cell_line, remove the duplicated rows based on
## protein_id, finally add the tbl_mean to tbl_info
tbl_info <- tbl |>
dplyr::select(-protein_expression, -cell_line) |>
dplyr::distinct(protein_id, .keep_all = TRUE)
tbl_mean <- dplyr::right_join(tbl_info, tbl_mean,
by = c("protein_id" = "protein_id"))
## create assay object
cols_a_start <- which(colnames(tbl_mean) == "A2058_SKIN")
a <- tbl_mean[, cols_a_start:ncol(tbl_mean)] |>
as.matrix()
rownames(a) <- dplyr::pull(tbl_mean, "protein_id")
## create rowData object
rD <- tbl_mean[, seq_len(cols_a_start - 1)] |>
as.data.frame()
rownames(rD) <- dplyr::pull(tbl_mean, "protein_id")
## create colData object, for the tissue column split the strings in column
## sample by "_", remove the first element (cell_line) and paste the remaining
## elements
cD <- data.frame(
sample = colnames(tbl_mean)[cols_a_start:ncol(tbl_mean)]
) |>
mutate(tissue = unlist(lapply(strsplit(sample, split = "_"), function(sample_i)
paste0(sample_i[-1], collapse = "_"))))
rownames(cD) <- cD$sample
## create the SummarizedExperiment
se <- SummarizedExperiment(assay = a, rowData = rD, colData = cD)
## here we remove the features that have a standard deviation of 0
se_sds <- apply(assay(se), 1, sd, na.rm = TRUE)
se <- se[!is.na(se_sds) & se_sds > 0, ]
Again, the shiny application can be started via
A description of the tabs Samples
, Values
,
Dimension Reduction
, and DE
can be found above
under the section QC analysis of TCGA RNA-seq data
. We will
focus here on the two tabs that are shown only for data sets that
contain missing values, Measured Values
and
Missing Values
.
The tabs Measured Values
and Missing Values
are only displayed if the SummarizedExperiment
object
contains missing values as in the case of this proteomics dataset.
The layout in the tabs Measured Values
and
Missing Values
is similar. Therefore, the two tabs are
described here simultaneously and the differences are pointed out where
necessary.
The plot shows the number of measured or missing values per sample in
the data set (depending on the selected tab). For example for the tab
Measured Values
, the plot will look like
In this case, the samples show approximately the same number of measured features, i.e. there is no indication to remove any sample based on this measure.
The plot shows different data depending on which tab
(Measured Values
or Missing Values
) is
selected. The binwidth will be determined by the slider input
(Binwidth (Measured Values)
or
Binwidth (Missing Values)
).
The plot shows how often a feature was measured in a certain number of samples.
Examples:
in the case of one sample (x-axis), the y-axis will denote the number of features for which only one feature was measured.
in the case of the number of total samples (x-axis), the y-axis will denote the number of features with complete observations (i.e. the number of features for which the feature was quantified in all samples).
The plot shows how often a feature was missing in a certain number of samples.
Examples:
in the case of one sample (x-axis), the y-axis will denote the number of features for which only one feature was missing.
in the case of the number of total samples (x-axis), the y-axis will denote the number of features with completely missing observations (i.e. the number of features for which no feature was quantified in all samples).
The plot in this panel can be read accordingly to the one in the
panel Histogram Features
, but, it is segregated for the
specified variable (e.g. it shows the distribution of measured/missing
values among the sample types).
The plot shows the interaction of sets between different variables depending on their presence or absence. The dots in the UpSet plot specify if the criteria for presence/absence are fulfilled (see below for the definition). In each column the intersections are displayed together with the number of features regarding the type of intersection. The boxplots in the rows display the number of present/absent features per set.
All sets present in the data set are displayed. Depending on the data
set, however, not all intersections sets are displayed (see the help
page for the upset
function from the UpSetR
package).
Presence is defined by a feature being measured in at least one sample of a set.
Absence is defined by a feature with only missing values (i.e. no measured values) of a set.
This panel will retrieve the features specified by intersection of sets.
This panel builds upon the Upset
panel. By selecting the
check boxes, the names of the features (taken from the
rownames
of the features) are printed as text that fulfill
the defined intersection of sets.
Example: Four sets (Type_1
, Type_2
,
Type_3
, and Type_4
) are found for a specified
variable (here: type
). When selecting the boxes for
Type_1
and Type_2
(while not selecting the
boxes for Type_3
and Type_4
) the features that
are present/absent for Type_1
and Type_2
(but
not in the sets Type_3
and Type_4
) are
returned.
For the tab Measured Values
, presence is defined by a
feature being measured in at least one sample of a set.
For the tab Missing Values
, absence is defined by a
feature with only missing values (i.e. no measured values) of a set.
All software and respective versions to build this vignette are listed here:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] limma_3.63.3 ExperimentHub_2.15.0
## [3] AnnotationHub_3.15.0 BiocFileCache_2.15.0
## [5] dbplyr_2.5.0 MatrixQCvis_1.15.0
## [7] shiny_1.10.0 plotly_4.10.4
## [9] ggplot2_3.5.1 SummarizedExperiment_1.37.0
## [11] Biobase_2.67.0 GenomicRanges_1.59.1
## [13] GenomeInfoDb_1.43.2 IRanges_2.41.2
## [15] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [17] generics_0.1.3 MatrixGenerics_1.19.1
## [19] matrixStats_1.5.0 DT_0.33
## [21] knitr_1.49 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.4.2 later_1.4.1 norm_1.0-11.1
## [4] filelock_1.0.3 tibble_3.2.1 preprocessCore_1.69.0
## [7] XML_3.99-0.18 rpart_4.1.24 lifecycle_1.0.4
## [10] edgeR_4.5.1 doParallel_1.0.17 lattice_0.22-6
## [13] MASS_7.3-64 crosstalk_1.2.1 backports_1.5.0
## [16] magrittr_2.0.3 Hmisc_5.2-2 sass_0.4.9
## [19] rmarkdown_2.29 jquerylib_0.1.4 yaml_2.3.10
## [22] httpuv_1.6.15 askpass_1.2.1 reticulate_1.40.0
## [25] DBI_1.2.3 buildtools_1.0.0 RColorBrewer_1.1-3
## [28] abind_1.4-8 Rtsne_0.17 purrr_1.0.2
## [31] nnet_7.3-20 rappdirs_0.3.3 sandwich_3.1-1
## [34] sva_3.55.0 proDA_1.21.0 circlize_0.4.16
## [37] GenomeInfoDbData_1.2.13 maketools_1.3.1 genefilter_1.89.0
## [40] umap_0.2.10.0 RSpectra_0.16-2 annotate_1.85.0
## [43] codetools_0.2-20 DelayedArray_0.33.3 tidyselect_1.2.1
## [46] gmm_1.8 shape_1.4.6.1 farver_2.1.2
## [49] UCSC.utils_1.3.0 base64enc_0.1-3 jsonlite_1.8.9
## [52] GetoptLong_1.0.5 Formula_1.2-5 survival_3.8-3
## [55] iterators_1.0.14 foreach_1.5.2 tools_4.4.2
## [58] Rcpp_1.0.14 glue_1.8.0 gridExtra_2.3
## [61] SparseArray_1.7.2 xfun_0.50 mgcv_1.9-1
## [64] dplyr_1.1.4 shinydashboard_0.7.2 withr_3.0.2
## [67] BiocManager_1.30.25 fastmap_1.2.0 shinyjs_2.1.0
## [70] openssl_2.3.1 digest_0.6.37 R6_2.5.1
## [73] mime_0.12 imputeLCMD_2.1 colorspace_2.1-1
## [76] jpeg_0.1-10 RSQLite_2.3.9 UpSetR_1.4.0
## [79] hexbin_1.28.5 tidyr_1.3.1 data.table_1.16.4
## [82] httr_1.4.7 htmlwidgets_1.6.4 S4Arrays_1.7.1
## [85] pkgconfig_2.0.3 gtable_0.3.6 blob_1.2.4
## [88] ComplexHeatmap_2.23.0 impute_1.81.0 XVector_0.47.2
## [91] sys_3.4.3 htmltools_0.5.8.1 shinyhelper_0.3.2
## [94] clue_0.3-66 scales_1.3.0 tmvtnorm_1.6
## [97] png_0.1-8 rstudioapi_0.17.1 rjson_0.2.23
## [100] checkmate_2.3.2 nlme_3.1-166 curl_6.1.0
## [103] cachem_1.1.0 zoo_1.8-12 GlobalOptions_0.1.2
## [106] stringr_1.5.1 BiocVersion_3.21.1 parallel_4.4.2
## [109] foreign_0.8-88 AnnotationDbi_1.69.0 vsn_3.75.0
## [112] pillar_1.10.1 grid_4.4.2 vctrs_0.6.5
## [115] pcaMethods_1.99.0 promises_1.3.2 xtable_1.8-4
## [118] cluster_2.1.8 htmlTable_2.4.3 evaluate_1.0.3
## [121] mvtnorm_1.3-3 cli_3.6.3 locfit_1.5-9.10
## [124] compiler_4.4.2 rlang_1.1.4 crayon_1.5.3
## [127] labeling_0.4.3 affy_1.85.1 plyr_1.8.9
## [130] stringi_1.8.4 viridisLite_0.4.2 BiocParallel_1.41.0
## [133] munsell_0.5.1 Biostrings_2.75.3 lazyeval_0.2.2
## [136] Matrix_1.7-1 bit64_4.5.2 KEGGREST_1.47.0
## [139] statmod_1.5.0 memoise_2.0.1 affyio_1.77.2
## [142] bslib_0.8.0 bit_4.5.0.1