This vignette demonstrates the use plotCellTypeMDS()
,
plotCellTypePCA()
, boxplotPCA()
and
calculateDiscriminantSpace()
, which are designed to help
assess and compare cell types between query and reference datasets using
Multidimensional Scaling (MDS) and Principal Component Analysis (PCA),
and Fisher Discriminant Analysis (FDA) respectively.
This vignette also demonstrates how to visualize gene expression data
from single-cell RNA sequencing (scRNA-seq) experiments using two
functions: plotGeneExpressionDimred()
and
plotMarkerExpression()
. These functions allow researchers
to explore gene expression patterns across various dimensions and
compare expression distributions between different datasets or cell
types.
Lastly this vignette will demonstrate how to visualize quality control (QC) scores and annotation scores, potentially identifying relationship between QC statistics (e.g., total library size or percentage of mitochondrial genes) and cell type annotation scores.
In the context of the scDiagnostics
package, this
vignette illustrates how to evaluate cell type annotations using two
distinct datasets:
reference_data
: This dataset consists of
meticulously curated cell type annotations assigned by experts,
providing a robust benchmark for assessing the quality of annotations.
It serves as a standard reference to evaluate and detect anomalies or
inconsistencies within the cell type annotations.
query_data
: This dataset includes annotations both
from expert assessments and those generated by the SingleR
package. By comparing these annotations, you can identify discrepancies
between manual and automated results, thereby pinpointing potential
inconsistencies or areas that may need further scrutiny.
qc_data
: This dataset includes data from
haematopoietic stem and progenitor cells, including QC metrics like
library size and mitochondrial content, SingleR-predicted
cell types and annotation scores indicating prediction
confidence.
The plotCellTypeMDS()
function creates a scatter plot
using Multidimensional Scaling (MDS) to visualize the similarity between
cell types in query and reference datasets. This function generates a
MDS plot that colors cells according to their types based on a
dissimilarity matrix calculated from a specified gene set.
The plotCellTypePCA()
function projects the query
dataset onto the PCA space of the reference dataset. It then plots
specified principal components for different cell types, allowing
comparison of PCA results between query and reference datasets.
# Generate the MDS scatter plot with cell type coloring
plotCellTypePCA(query_data = query_data,
reference_data = reference_data,
cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
query_cell_type_col = "SingleR_annotation",
ref_cell_type_col = "expert_annotation",
pc_subset = 1:3)
Here, plotCellTypePCA()
projects the query data into the
PCA space of the reference data and plots the specified principal
components. The pc_subset
argument allows you to select
which principal components to display.
The boxplotPCA()
function provides a boxplot
visualization of principal components for different cell types across
query and reference datasets. This function helps in comparing the
distributions of PCA scores for specified cell types and datasets.
# Plot the PCA data as boxplots
boxplotPCA(query_data = query_data,
reference_data = reference_data,
cell_types = c("CD4", "CD8", "B_and_plasma", "Myeloid"),
query_cell_type_col = "SingleR_annotation",
ref_cell_type_col = "expert_annotation",
pc_subset = 1:3)
In this example, boxplotPCA()
generates boxplots for the
specified principal components, showing the distribution of PCA scores
for different cell types and datasets.
The calculateDiscriminantSpace()
function projects query
single-cell RNA-seq data onto a discriminant space defined by reference
data. This process involves using the reference data to identify key
variables, compute discriminant vectors, and project both the reference
and query data onto this space. The function then assesses the
similarity between the projections of the query data and the reference
data using cosine similarity and Mahalanobis distance.
The function performs the following steps for each pairwise combination of cell types:
First, applying the function to the SingleR
annotation.
# Compute important variables for all pairwise cell comparisons
disc_output <- calculateDiscriminantSpace(
reference_data = reference_data,
query_data = query_data,
query_cell_type_col = "SingleR_annotation",
ref_cell_type_col = "expert_annotation")
# Generate scatter and boxplot
plot(disc_output, plot_type = "scatterplot")
We aim to evaluate whether a new batch of query cells fits well
within established reference cell types or if there are cells exhibiting
unusual characteristics. We will use the Mahalanobis distance calculated
by the calculateDiscriminantSpace()
function to achieve
this.
# Perform discriminant analysis and projection
discriminant_results <- calculateDiscriminantSpace(
reference_data = reference_data,
query_data = query_data,
ref_cell_type_col = "expert_annotation",
query_cell_type_col = "SingleR_annotation",
calculate_metrics = TRUE, # Compute Mahalanobis distance and cosine similarity
alpha = 0.01 # Significance level for Mahalanobis distance cutoff
)
You can extract and analyze the Mahalanobis distances of query cells relative to reference cell types. High Mahalanobis distances may indicate potential anomalies or novel cell states.
# Extract Mahalanobis distances
mahalanobis_distances <- discriminant_results$`CD4-CD8`$query_mahalanobis_dist
# Identify anomalies based on a threshold
threshold <- discriminant_results$`CD4-CD8`$mahalanobis_crit # Use the computed cutoff value
anomalies <- mahalanobis_distances[mahalanobis_distances > threshold]
You can now create visualizations to illustrate the distribution of Mahalanobis distances and highlight potential anomalies.
# Load ggplot2 for visualization
library(ggplot2)
# Convert distances to a data frame
distances_df <- data.frame(Distance = mahalanobis_distances,
Anomaly = ifelse(mahalanobis_distances > threshold, "Anomaly", "Normal"))
# Plot histogram of Mahalanobis distances
ggplot(distances_df, aes(x = Distance, fill = Anomaly)) +
geom_histogram(binwidth = 0.5, position = "identity", alpha = 0.7) +
labs(title = "Histogram of Mahalanobis Distances",
x = "Mahalanobis Distance",
y = "Frequency") +
theme_bw()
Low Mahalanobis Distances: These query cells align well with the reference cell types, suggesting they are similar to the established cell types.
High Mahalanobis Distances: Cells with high distances may represent outliers or novel, previously uncharacterized cell types. Further investigation may be needed to understand these anomalies.
The calculateSIRSpace()
function leverages Sliced
Inverse Regression (SIR) to analyze high-dimensional cell type data by
focusing on the directions that best separate different cell types and
their subtypes. The function projects data onto a lower-dimensional SIR
space, aiming to highlight the most informative features for
distinguishing between cell types.
The process starts by handling each cell type as a distinct slice of
the data, allowing for a more detailed analysis. When
multiple_cond_means
is set to TRUE
, the
function refines this by computing conditional means for each cell type,
using clustering to capture finer distinctions within each type. It then
performs Singular Value Decomposition (SVD) on these refined means to
uncover the principal directions that most effectively differentiate the
cell types.
Ultimately, the function provides projections that reveal how cell types and their subtypes relate to each other in a reduced space, facilitating a deeper understanding of cellular variability and distinctions in complex datasets.
# Calculate the SIR space projections
sir_output <- calculateSIRSpace(
reference_data = reference_data,
query_data = query_data,
query_cell_type_col = "SingleR_annotation",
ref_cell_type_col = "expert_annotation",
multiple_cond_means = TRUE
)
# Plot the SIR space projections
plot(sir_output,
plot_type = "boxplot",
sir_subset = 1:3)
In this example, we compute the SIR space projections using the
calculateSIRSpace()
function, and then we visualize the
projections with a boxplot to analyze the distribution of the data in
the SIR space. This example helps you understand how different cell
types and their subtypes are projected into a common space, facilitating
their comparison and interpretation.
The plot below visualizes the contribution of different markers to the variance along the SIR dimensions, helping identify which markers are most influential in differentiating e.g. between B cells (including plasma cells) from other cell types.
The plotGeneExpressionDimred()
function plots the
expression of a specific gene on a dimensional reduction plot. This
function is useful for exploring how gene expression varies across cells
in the context of reduced dimensions, such as those derived from PCA,
t-SNE, or UMAP. Each point on the plot represents a single cell,
color-coded according to the expression level of the specified gene.
Let’s visualize the expression of the gene VPREB3 on a PCA plot:
# Visualize VPREB3 gene on a PCA plot
plotGeneExpressionDimred(se_object = query_data,
method = "PCA",
pc_subset = 1:3,
feature = "VPREB3")
In this example, we visualize the expression of the
VPREB3 gene on a PCA plot derived from the
query_data
dataset. The method
argument
specifies the dimensional reduction technique, which in this case is
“PCA.” The pc_subset
argument allows us to focus on
specific principal components (PCs), here selecting the first five. The
feature
argument is the name of the gene we want to
visualize.
When executed, this function generates a scatter plot with cells colored by their expression levels of VPREB3. The resulting plot can help identify clusters of cells with similar expression patterns or highlight how expression varies across the dataset.
You can also visualize gene expression using other dimensional reduction methods, such as t-SNE or UMAP.
The plotMarkerExpression()
function compares the
distribution of a specific gene’s expression across an overall dataset
and within specific cell types. It generates density plots to visualize
these distributions, providing a graphical means of assessing the
similarity between reference and query datasets. The function also
applies a z-rank normalization to make expression values comparable
between datasets.
plotMarkerExpression(
reference_data = reference_data,
query_data = query_data,
ref_cell_type_col = "expert_annotation",
query_cell_type_col = "SingleR_annotation",
gene_name = "VPREB3",
cell_type = "B_and_plasma"
)
In this example, the function compares the expression distribution of
the VPREB3 gene between the reference and query
datasets. The ref_cell_type_col
and
query_cell_type_col
arguments specify the columns in
colData
that identify the cell types in the reference and
query datasets, respectively. The cell_type
argument
specifies the cell type of interest, and gene_name
is the
gene whose expression distribution we wish to compare.
When executed, this function generates two density plots: one showing the overall gene expression distribution across all cells and another focusing on the specified cell type. These plots help identify whether the gene’s expression levels are similar between the reference and query datasets, providing insight into dataset alignment.
The scatter plot visualizes the relationship between QC statistics (e.g., total library size or percentage of mitochondrial genes) and cell type annotation scores. This plot helps in understanding how QC metrics influence or correlate with the predicted cell types.
# Generate scatter plot
p1 <- plotQCvsAnnotation(se_object = qc_data,
cell_type_col = "SingleR_annotation",
qc_col = "total",
score_col = "annotation_scores")
p1 + ggplot2::xlab("Library Size")
A scatter plot can reveal patterns such as whether cells with higher library sizes or mitochondrial content tend to be associated with specific annotations. For instance, cells with unusually high mitochondrial content might be identified as low-quality or stressed, potentially affecting their annotations.
Histograms provide a distribution view of QC metrics and annotation scores. They help in evaluating the range, central tendency, and spread of these variables across cells.
# Generate histograms
histQCvsAnnotation(se_object = query_data,
cell_type_col = "SingleR_annotation",
qc_col = "percent_mito",
score_col = "annotation_scores")
Histograms are useful for assessing the overall distribution of QC metrics and annotation scores. For example, if the majority of cells have high percent_mito, it might indicate that many cells are stressed or dying, which could impact the quality of the data.
Dimensional reduction plots (PCA, t-SNE, UMAP) are used to visualize the relationships between cells in reduced dimensions. Overlaying gene set scores on these plots provides insights into how specific gene activities are distributed across cell clusters.
# Plot gene set scores on PCA
plotGeneSetScores(se_object = query_data,
method = "PCA",
score_col = "gene_set_scores",
pc_subset = 1:3)
By visualizing gene set scores on PCA or UMAP plots, one can identify clusters of cells with high or low gene set activities. This can help in understanding the biological relevance of different gene sets or pathways in various cell states or types.
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] scater_1.35.0 ggplot2_3.5.1
[3] scran_1.35.0 scuttle_1.17.0
[5] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
[7] Biobase_2.67.0 GenomicRanges_1.59.1
[9] GenomeInfoDb_1.43.2 IRanges_2.41.1
[11] S4Vectors_0.45.2 BiocGenerics_0.53.3
[13] generics_0.1.3 MatrixGenerics_1.19.0
[15] matrixStats_1.4.1 scDiagnostics_1.1.0
[17] BiocStyle_2.35.0
loaded via a namespace (and not attached):
[1] DBI_1.2.3 gridExtra_2.3 cramer_0.9-4
[4] rlang_1.1.4 magrittr_2.0.3 ggridges_0.5.6
[7] compiler_4.4.2 systemfonts_1.1.0 vctrs_0.6.5
[10] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
[13] XVector_0.47.0 labeling_0.4.3 utf8_1.2.4
[16] rmarkdown_2.29 UCSC.utils_1.3.0 ggbeeswarm_0.7.2
[19] ragg_1.3.3 xfun_0.49 bluster_1.17.0
[22] zlibbioc_1.52.0 cachem_1.1.0 beachmat_2.23.2
[25] jsonlite_1.8.9 DelayedArray_0.33.2 BiocParallel_1.41.0
[28] irlba_2.3.5.1 parallel_4.4.2 cluster_2.1.6
[31] biglm_0.9-3 R6_2.5.1 bslib_0.8.0
[34] ranger_0.17.0 limma_3.63.2 boot_1.3-31
[37] jquerylib_0.1.4 Rcpp_1.0.13-1 knitr_1.49
[40] Matrix_1.7-1 igraph_2.1.1 tidyselect_1.2.1
[43] abind_1.4-8 yaml_2.3.10 viridis_0.6.5
[46] codetools_0.2-20 lattice_0.22-6 tibble_3.2.1
[49] withr_3.0.2 evaluate_1.0.1 pillar_1.9.0
[52] BiocManager_1.30.25 munsell_0.5.1 scales_1.3.0
[55] glue_1.8.0 metapod_1.15.0 maketools_1.3.1
[58] tools_4.4.2 speedglm_0.3-5 BiocNeighbors_2.1.1
[61] sys_3.4.3 data.table_1.16.2 ScaledMatrix_1.15.0
[64] locfit_1.5-9.10 buildtools_1.0.0 grid_4.4.2
[67] edgeR_4.5.0 colorspace_2.1-1 patchwork_1.3.0
[70] GenomeInfoDbData_1.2.13 beeswarm_0.4.0 BiocSingular_1.23.0
[73] vipor_0.4.7 cli_3.6.3 rsvd_1.0.5
[76] textshaping_0.4.0 fansi_1.0.6 S4Arrays_1.7.1
[79] viridisLite_0.4.2 dplyr_1.1.4 gtable_0.3.6
[82] isotree_0.6.1-1 sass_0.4.9 digest_0.6.37
[85] SparseArray_1.7.2 ggrepel_0.9.6 dqrng_0.4.1
[88] farver_2.1.2 htmltools_0.5.8.1 lifecycle_1.0.4
[91] httr_1.4.7 statmod_1.5.0 transport_0.15-4
[94] MASS_7.3-61