This vignette will review the steps needed to implement the
GloScope
methodology. GloScope
is a framework
for creating profiles of scRNA-Seq samples in order to globally compare
and analyze them across patients or tissue samples. First this
methodology estimates a global gene expression distribution for each
sample, and then it calculates how divergent pairs of samples are from
each other. The output from the package’s main function,
gloscope()
, is a n × n divergence matrix
containing the pairwise statistical divergences between all samples.
This divergence matrix can be the input to other downstream statistical
and machine learning tools. This package has been submitted to
Bioconductor because we believe it may be of particular interest to the
bioinformatic community and because it depends on other packages from
Bioconductor.
GloScope
estimates the gene expression density from a
low dimensional embedding of the UMI count measurements, such as PCA or
scVI embeddings. Users must provide GloScope
with a
data.frame
containing each cell’s embedding coordinates,
along with the metadata which identifies to which sample each cell
belongs to.
The package provides a helper function plotMDS
which
allows the user to visualize the divergence matrix with multidimensional
scaling (MDS), but the divergence matrix can also be visualized with
other algorithms as well.
A standard workflow for GloScope
consists of:
Obtain the dimension reduction embedding of the cells and specify
how many dimensions to keep. This is computed outside of the
GloScope
package.
Choose a density estimation method (Gaussian mixture or k-nearest neighbours) to estimate each sample’s latent distribution.
Calculate the symmetric KL divergence or Jensen-Shannon divergence between all pairs of samples.
Visualize the distance matrix with the first two dimensions of
MDS using the plotMDS
function.
In this section, we use a toy example to illustrate the input, output
and visualization steps of the GloScope
pipeline.
The data is a subset of that presented in Stephenson et al. (2021), and was obtained from
this
hyperlinked URL. The dataset has a total of 647,366 peripheral blood
mononuclear cells (PBMCs) from 130 patients, whose phenotypes are
COVID-infected, healthy control donor, patients with other non-COVID
respiratory disease, and volunteers administered with intravenous
lipopolysaccharide (IV-LPS). To enable faster computation in this
tutorial, we subset this dataset to 20 random COVID and healthy control
donors and further subsample each patient’s count matrix to 500 random
cells. This subsampled data is provided in the SingleCellExperiment
object example_SCE
. We emphasize that this subsampling
procedure is only for demonstration purposes and is not a recommended
step in normal analyses.
We first load the GloScope
package and the
aforementioned example dataset.
The example SingleCellExperiment object contains the first 50 principal components of the subsampled cells, as well as the sample and phenotype labels associated with each cell.
head(SingleCellExperiment::reducedDim(example_SCE,"PCA")[,1:10])
#> PC_1 PC_2 PC_3 PC_4
#> BGCV03_GCGCCAATCTGGCGTG-1 -4.677964 0.26785278 -2.68663740 1.5347402
#> BGCV03_CGTCACTTCCCAAGAT-1 -2.877392 0.04509136 12.38307762 4.5973921
#> BGCV15_TGAGGGACATCGACGC-1 -3.625502 -0.05260641 1.15989053 -5.6360350
#> BGCV03_GATCAGTAGTTACGGG-1 13.487988 -7.37058783 0.77435076 -0.8953347
#> BGCV03_TGCGGGTCACGGACAA-1 -3.740279 0.27755079 0.07352045 -4.6179056
#> BGCV03_CATGACAGTCCAGTAT-1 -2.349224 2.82407236 0.93374163 -4.7708669
#> PC_5 PC_6 PC_7 PC_8
#> BGCV03_GCGCCAATCTGGCGTG-1 0.2855917 -0.39382035 -0.09173895 -1.4540112
#> BGCV03_CGTCACTTCCCAAGAT-1 -3.2936299 -0.07147141 -1.95012903 -1.1909559
#> BGCV15_TGAGGGACATCGACGC-1 -3.1034510 2.69858479 0.35455966 -0.7677660
#> BGCV03_GATCAGTAGTTACGGG-1 -7.4018416 5.53588390 -3.99845910 2.1559813
#> BGCV03_TGCGGGTCACGGACAA-1 -2.6745775 1.97423577 -0.32567784 -0.3111276
#> BGCV03_CATGACAGTCCAGTAT-1 -0.5981533 0.95109445 -1.91192818 1.6709988
#> PC_9 PC_10
#> BGCV03_GCGCCAATCTGGCGTG-1 -0.6873072 3.0339348
#> BGCV03_CGTCACTTCCCAAGAT-1 0.9220569 -1.0549433
#> BGCV15_TGAGGGACATCGACGC-1 0.2317176 -1.9861494
#> BGCV03_GATCAGTAGTTACGGG-1 -2.9655695 3.9789870
#> BGCV03_TGCGGGTCACGGACAA-1 0.4197560 -0.8079447
#> BGCV03_CATGACAGTCCAGTAT-1 -1.9040219 -2.5736504
head(SingleCellExperiment::colData(example_SCE))
#> DataFrame with 6 rows and 3 columns
#> sample_id phenotype cluster_id
#> <factor> <factor> <factor>
#> BGCV03_GCGCCAATCTGGCGTG-1 CV0176 Covid CD8
#> BGCV03_CGTCACTTCCCAAGAT-1 CV0176 Covid B_cell
#> BGCV15_TGAGGGACATCGACGC-1 CV0176 Covid CD4
#> BGCV03_GATCAGTAGTTACGGG-1 CV0176 Covid CD14
#> BGCV03_TGCGGGTCACGGACAA-1 CV0176 Covid CD8
#> BGCV03_CATGACAGTCCAGTAT-1 CV0176 Covid RBC
The following table confirms that each donor provides 500 cells, and that both phenotypes are represented.
table(SingleCellExperiment::colData(example_SCE)$sample_id, SingleCellExperiment::colData(example_SCE)$phenotype)
#>
#> Covid Healthy LPS Non_covid
#> AP1 0 0 0 0
#> AP2 0 0 0 0
#> AP3 0 0 0 0
#> AP4 0 0 0 0
#> AP5 0 0 0 0
#> AP6 0 0 0 0
#> AP8 0 0 0 0
#> AP9 0 0 0 0
#> AP10 0 0 0 0
#> AP11 0 0 0 0
#> AP12 0 0 0 0
#> CV0025 500 0 0 0
#> CV0037 500 0 0 0
#> CV0050 500 0 0 0
#> CV0052 0 0 0 0
#> CV0058 500 0 0 0
#> CV0059 0 0 0 0
#> CV0062 0 0 0 0
#> CV0068 0 0 0 0
#> CV0073 0 0 0 0
#> CV0074 0 0 0 0
#> CV0084 0 0 0 0
#> CV0094 0 0 0 0
#> CV0100 500 0 0 0
#> CV0104 0 0 0 0
#> CV0120 500 0 0 0
#> CV0128 500 0 0 0
#> CV0134 0 0 0 0
#> CV0137 500 0 0 0
#> CV0144 500 0 0 0
#> CV0155 500 0 0 0
#> CV0160 500 0 0 0
#> CV0164 0 0 0 0
#> CV0171 0 0 0 0
#> CV0176 500 0 0 0
#> CV0178 0 0 0 0
#> CV0180 500 0 0 0
#> CV0198 0 0 0 0
#> CV0200 0 0 0 0
#> CV0201 0 0 0 0
#> CV0231 0 0 0 0
#> CV0234 500 0 0 0
#> CV0257 500 0 0 0
#> CV0262 0 0 0 0
#> CV0279 500 0 0 0
#> CV0284 0 0 0 0
#> CV0326 0 0 0 0
#> CV0902 0 0 0 0
#> CV0904 0 0 0 0
#> CV0911 0 0 0 0
#> CV0915 0 500 0 0
#> CV0917 0 500 0 0
#> CV0926 0 0 0 0
#> CV0929 0 0 0 0
#> CV0934 0 0 0 0
#> CV0939 0 0 0 0
#> CV0940 0 500 0 0
#> CV0944 0 500 0 0
#> MH8919176 0 0 0 0
#> MH8919177 0 0 0 0
#> MH8919178 0 0 0 0
#> MH8919179 0 0 0 0
#> MH8919226 0 0 0 0
#> MH8919227 0 0 0 0
#> MH8919228 0 0 0 0
#> MH8919229 0 0 0 0
#> MH8919230 0 0 0 0
#> MH8919231 0 0 0 0
#> MH8919232 0 0 0 0
#> MH8919233 0 0 0 0
#> MH8919276 0 0 0 0
#> MH8919277 0 0 0 0
#> MH8919278 0 0 0 0
#> MH8919279 0 0 0 0
#> MH8919280 0 0 0 0
#> MH8919281 0 0 0 0
#> MH8919282 0 0 0 0
#> MH8919283 0 0 0 0
#> MH8919326 0 0 0 0
#> MH8919327 0 0 0 0
#> MH8919328 0 0 0 0
#> MH8919329 0 0 0 0
#> MH8919330 0 0 0 0
#> MH8919331 0 0 0 0
#> MH8919332 0 0 0 0
#> MH8919333 0 0 0 0
#> MH9143270 0 0 0 0
#> MH9143271 0 0 0 0
#> MH9143272 0 0 0 0
#> MH9143273 0 0 0 0
#> MH9143274 0 0 0 0
#> MH9143275 0 0 0 0
#> MH9143276 0 0 0 0
#> MH9143277 0 0 0 0
#> MH9143320 0 0 0 0
#> MH9143321 0 0 0 0
#> MH9143322 0 0 0 0
#> MH9143323 0 0 0 0
#> MH9143324 0 0 0 0
#> MH9143325 0 0 0 0
#> MH9143326 0 0 0 0
#> MH9143327 0 0 0 0
#> MH9143370 0 0 0 0
#> MH9143371 0 0 0 0
#> MH9143372 0 0 0 0
#> MH9143373 0 0 0 0
#> MH9143420 0 0 0 0
#> MH9143421 0 0 0 0
#> MH9143422 0 0 0 0
#> MH9143423 0 0 0 0
#> MH9143424 0 0 0 0
#> MH9143425 0 0 0 0
#> MH9143426 0 0 0 0
#> MH9143427 0 0 0 0
#> MH9179821 0 0 0 0
#> MH9179822 0 0 0 0
#> MH9179823 0 0 0 0
#> MH9179824 0 0 0 0
#> MH9179825 0 0 0 0
#> MH9179826 0 0 0 0
#> MH9179827 0 0 0 0
#> MH9179828 0 0 0 0
#> newcastle004v2 0 0 0 0
#> newcastle20 0 0 0 0
#> newcastle21 0 0 0 0
#> newcastle21v2 0 0 0 0
#> newcastle49 0 0 0 0
#> newcastle59 0 0 0 0
#> newcastle65 0 0 0 0
#> newcastle74 0 0 0 0
GloScope
expects that the user provides
GloScope
with a data.frame
containing each
cell’s low-dimensional embedding (with cells in rows), along with a
vector which contains the sample from which each cell in the embedding
matrix is drawn. The PCA embeddings in the example dataset are provided
by the authors of Stephenson et al.
(2021). In general, users of GloScope
can input any
dimensionality reduction to the method. UMI counts are often provided as
Seurat
or SingleCellExperiment
objects, and
many dimensionality reduction strategies, including PCA and scVI (Lopez et al. (2018)) can be computed and saved
within those data structures. This is recommended, as it will allow the
user to keep the counts, embeddings, and the meta data (like the sample
from which each cell was isolated) in the same structure and will
minimize the change of inadvertently mislabelling the sample of a
cell.
The following code, which is only pseudo-code and not evaluated
(since we do not provide the seurat_object
), demonstrates
how PCA embeddings and sample labels can be extracted from a
Seurat
object for input into the gloscope
function.
With the cell embeddings and the sample labels in the proper format,
the gloscope
function is simple to setup and run. For
simplicity, we will save these as separate objects, though for real
datasets, this would not be recommended since it would unnecessarily
make copies of the data and increase memory usage:
embedding_matrix <- SingleCellExperiment::reducedDim(example_SCE,"PCA")[,1:10]
sample_ids <- SingleCellExperiment::colData(example_SCE)$sample_id
Although the example data contains the first 50 principal components, we chose to use only the first 10 for calculations. Large number of latent variables will make the density estimation unstable, so we do not recommend large increases to the number of latent variables.
The base function call is
gloscope(embedding_df, sample_ids)
.
The default implementation, run above, implements the
GMM
option for density estimation; this is the method
primarily considered in the manuscript, but can take longer to run so we
haven’t evaluated it here. (You can evaluate it by changing
eval=FALSE
to eval=TRUE
).
An alternative methods uses a non-parametric alternative for density
estimation based on a k-nearest neighbours algorithm and
can be chosen with the argument dens
. We will use this on
our examples simply to make the tutorial run quickly:
knn_divergence <- gloscope(embedding_matrix, sample_ids, dens="KNN")
knn_divergence[1:5,1:5]
#> CV0176 CV0257 CV0279 CV0120 CV0917
#> CV0176 0.000000 2.388813 1.893389 2.024262 2.333317
#> CV0257 2.388813 0.000000 1.608448 3.048070 4.432494
#> CV0279 1.893389 1.608448 0.000000 2.414725 1.690126
#> CV0120 2.024262 3.048070 2.414725 0.000000 2.465142
#> CV0917 2.333317 4.432494 1.690126 2.465142 0.000000
Note, that unlike PCA, not every dimensionality reduction method retains its statistical properties when only a subset of the coordinates is retained, with scVI being an example. For methods like scVI, you should choose the number of latent variables you will want to use when calculating the latent variables and not subset them after the fact.
If the user chooses the default GMM method (dens="GMM"
),
gloscope
fits sample-level densities with Gaussian mixture
models (GMMs) implemented by mclust
. The
mclust
package uses the Bayesian information criterion
(BIC) to select a GMM from a family of models, and the user of
GloScope
can specify how many centroids should be
considered in that family. By default GMMs with 1 through 9
centroids are compared. With the num_components
optional
vector to gloscope
the user can specify the possible number
of centroids for mclust
to compute, with the one with the
best BIC value being the final choice.
When GMMs are used for density fitting, a Monte-Carlo approximation
is then used to compute the pairwise divergences from the estimated
densities. This means that the resulting estimate is stochastic, and
details about controlling the random seed appear later in this vignette.
The number of Monte-Carlo draws from the density of each sample is 10, 000 by default, and this is controlled by
the optional parameter r in
the gloscope
function.
# Can take a couple of minutes to run:
gmm_divergence_alt<-gloscope(embedding_matrix, sample_ids, dens = "GMM", num_components = c(2,4,6),r=20000)
A non-parametric alternative for density and divergence estimation is
the k-nearest neighbours
algorithm. To use this technique, the optional argument
dens="KNN"
should be set. The number of neighbors is a
hyperparameter, equal to 50 by default,
and governed by the optional argument k
. It is important to
note that negative divergences between similar samples are possible with
this density estimation choice. The gloscope
function does
not censor or round any negative values in the output matrix, leaving
that decision to the user.
The default divergence for GloScope
is the symmetric KL
divergence, but the Jensen-Shannon divergence is also implemented. This
can be controlled by setting the argument dist_mat="KL"
or
dist_mat="JS"
, respectively. One beneficial property of the
Jensen-Shannon divergence is that its square root is a proper distance
metric. Note that gloscope
returns a matrix of
untransformed divergences, and the user must take the square root of
matrix entries themselves if this is desired.
The plotMDS
function provided by this package visualizes
the output divergence matrix with multidimensional scaling. The
plotMDS
function utilizes the isoMDS
function
from the package MASS
, and then creates a scatter plot with
samples color-coded by a user-specified covariate such as phenotype
This function requires a data.frame
with the relevant
metadata at the sample level, rather than at the cell level. This can
easily be obtained by applying the unique
function from
base R to the cell-level metadata data.frame
.
pat_info <- as.data.frame(unique(SingleCellExperiment::colData(example_SCE)[,-c(3)]))
head(pat_info)
#> sample_id phenotype
#> BGCV03_GCGCCAATCTGGCGTG-1 CV0176 Covid
#> BGCV11_AGGCCGTCATCGATGT-1 CV0257 Covid
#> BGCV09_AGCTTGAGTACTTAGC-1 CV0279 Covid
#> BGCV05_CATGCCTGTGTCGCTG-1 CV0120 Covid
#> BGCV09_ATCATCTAGTGATCGG-1 CV0917 Healthy
#> BGCV08_CCACTACGTTAAGGGC-1 CV0915 Healthy
Here we plot the MDS representation with each sample color-coded by
the phenotype
variable. Note the function call returns both
a matrix of MDS embeddings and a ggplot
visualization of
the first two dimensions.
mds_result <- plotMDS(dist_mat = knn_divergence, metadata = pat_info, "sample_id","phenotype", k = 2)
mds_result$plot
Another classical way to visualize a divergence matrix is with a
heatmap. The following code demonstrates that the output of
gloscope
is easily used in plotting functions beyond the
package.
To speed-up calculations of the pair-wise divergences,
GloScope
allows for parallelizing the calculation. The
argument BPPARAM
controls the parameters of the
parallelization (see bplapply
).
The default is no parallelization, but the iteration across
sample-pairs will still via the function bplapply
. In this
case (i.e. no parallelization), the argument is simply
BPPARAM = BiocParallel::SerialParam()
.
IMPORTANT: Due to the construction of the NAMESPACE file, it is
essential that any setting of the BPPARAM optional argument uses the
BiocParallel::
namespace prefix. For example,
gloscope(...,BPPARAM = MulicoreParam()
will raise an error,
and gloscope(...,BPPARAM = BiocParallel::MulicoreParam()
should be used instead.
The calculation of the KL divergence from the GMM density estimate
uses Monte-Carlo approximation, and hence has to randomly sample from
the estimated density. To set the seed for the pseudo-random number
generator used in the simulation, the seed needs to be set within the
argument to BPPARAM
and not by a call to
set.seed
(see https://bioconductor.org/packages/release/bioc/vignettes/BiocParallel/inst/doc/Random_Numbers.html
for more information).
This is how the seed must be set, even if there is no
parallelization chosen (the default), because the iteration over
sample pairs is sent through bplapply
function regardless,
as noted above. Setting the seed outside the function via
set.seed
will not have an effect on the function.
The following is an example of how to set the random seed when
running the GMM
option, using the default of no
parallelization:
gmm_divergence <- gloscope(embedding_matrix, sample_ids, dens = "GMM", dist_mat = "KL",
BPPARAM = BiocParallel::SerialParam(RNGseed = 2))
The same argument (RNGseed
) can be added to other
BPPARAM
arguments to set the seed.
Note that the KNN
estimation procedure does not have any
Monte-Carlo approximation steps, and thus does not need to have a random
seed.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] GloScope_1.5.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.37.0 gtable_0.3.6
#> [3] xfun_0.49 bslib_0.8.0
#> [5] ggplot2_3.5.1 Biobase_2.67.0
#> [7] lattice_0.22-6 vctrs_0.6.5
#> [9] tools_4.4.2 generics_0.1.3
#> [11] stats4_4.4.2 parallel_4.4.2
#> [13] tibble_3.2.1 fansi_1.0.6
#> [15] pkgconfig_2.0.3 Matrix_1.7-1
#> [17] S4Vectors_0.45.2 lifecycle_1.0.4
#> [19] GenomeInfoDbData_1.2.13 farver_2.1.2
#> [21] compiler_4.4.2 FNN_1.1.4.1
#> [23] munsell_0.5.1 codetools_0.2-20
#> [25] GenomeInfoDb_1.43.1 htmltools_0.5.8.1
#> [27] sys_3.4.3 buildtools_1.0.0
#> [29] sass_0.4.9 yaml_2.3.10
#> [31] pillar_1.9.0 crayon_1.5.3
#> [33] jquerylib_0.1.4 MASS_7.3-61
#> [35] BiocParallel_1.41.0 SingleCellExperiment_1.29.1
#> [37] DelayedArray_0.33.2 cachem_1.1.0
#> [39] abind_1.4-8 mclust_6.1.1
#> [41] digest_0.6.37 labeling_0.4.3
#> [43] maketools_1.3.1 fastmap_1.2.0
#> [45] grid_4.4.2 colorspace_2.1-1
#> [47] cli_3.6.3 SparseArray_1.7.2
#> [49] magrittr_2.0.3 S4Arrays_1.7.1
#> [51] utf8_1.2.4 withr_3.0.2
#> [53] UCSC.utils_1.3.0 scales_1.3.0
#> [55] rmarkdown_2.29 XVector_0.47.0
#> [57] httr_1.4.7 matrixStats_1.4.1
#> [59] RANN_2.6.2 mvnfast_0.2.8
#> [61] evaluate_1.0.1 knitr_1.49
#> [63] GenomicRanges_1.59.0 IRanges_2.41.1
#> [65] rlang_1.1.4 Rcpp_1.0.13-1
#> [67] glue_1.8.0 BiocManager_1.30.25
#> [69] BiocGenerics_0.53.3 jsonlite_1.8.9
#> [71] R6_2.5.1 MatrixGenerics_1.19.0
#> [73] zlibbioc_1.52.0