library(miloR)
library(SingleCellExperiment)
library(scater)
library(scran)
library(dplyr)
library(patchwork)
Milo is a tool for analysis of complex single cell datasets generated from replicated multi-condition experiments, which detects changes in composition between conditions. While differential abundance (DA) is commonly quantified in discrete cell clusters, Milo uses partially overlapping neighbourhoods of cells on a KNN graph. Starting from a graph that faithfully recapitulates the biology of the cell population, Milo analysis consists of 3 steps:
In this vignette we will elaborate on how these steps are implemented
in the miloR
package.
For this demo we will use a synthetic dataset simulating a developmental trajectory, generated using dyntoy.
data("sim_trajectory", package = "miloR")
## Extract SingleCellExperiment object
traj_sce <- sim_trajectory[['SCE']]
## Extract sample metadata to use for testing
traj_meta <- sim_trajectory[["meta"]]
## Add metadata to colData slot
colData(traj_sce) <- DataFrame(traj_meta)
colnames(traj_sce) <- colData(traj_sce)$cell_id
redim <- reducedDim(traj_sce, "PCA")
dimnames(redim) <- list(colnames(traj_sce), paste0("PC", c(1:50)))
reducedDim(traj_sce, "PCA") <- redim
For DA analysis we need to construct an undirected KNN graph of
single-cells. Standard single-cell analysis pipelines usually do this
from distances in PCA. We normalize and calculate principal components
using scater
. I also run UMAP for visualization
purposes.
For differential abundance analysis on graph neighbourhoods we first
construct a Milo
object. This extends the SingleCellExperiment
class to store information about neighbourhoods on the KNN graph.
The Milo
constructor takes as input a
SingleCellExperiment
object.
## class: Milo
## dim: 500 500
## metadata(0):
## assays(2): counts logcounts
## rownames(500): G1 G2 ... G499 G500
## rowData names(0):
## colnames(500): C1 C2 ... C499 C500
## colData names(5): cell_id group_id Condition Replicate Sample
## reducedDimNames(2): PCA UMAP
## mainExpName: NULL
## altExpNames(0):
## nhoods dimensions(2): 1 1
## nhoodCounts dimensions(2): 1 1
## nhoodDistances dimension(1): 0
## graph names(0):
## nhoodIndex names(1): 0
## nhoodExpression dimension(2): 1 1
## nhoodReducedDim names(0):
## nhoodGraph names(0):
## nhoodAdjacency dimension(2): 1 1
We can use the zellkonverter
package to make a SingleCellExperiment
object from an
AnnData
object stored as h5ad
file.
We need to add the KNN graph to the Milo object. This is stored in
the graph
slot, in igraph
format. The
miloR
package includes functionality to build and store the
graph from the PCA dimensions stored in the reducedDim
slot.
## Constructing kNN graph with k:10
In progress: we are perfecting the functionality to
add a precomputed KNN graph (for example constructed with Seurat or
scanpy) to the graph
slot using the adjacency matrix.
We define the neighbourhood of a cell, the index, as the group of cells connected by an edge in the KNN graph to the index cell. For efficiency, we don’t test for DA in the neighbourhood of every cell, but we sample as indices a subset of representative cells, using a KNN sampling algorithm used by Gut et al. 2015.
For sampling you need to define a few parameters:
prop
: the proportion of cells to randomly sample to
start with (usually 0.1 - 0.2 is sufficient)k
: the k to use for KNN refinement (we recommend using
the same k used for KNN graph building)d
: the number of reduced dimensions to use for KNN
refinement (we recommend using the same d used for KNN graph
building)refined
indicated whether you want to use the sampling
refinement algorithm, or just pick cells at random. The default and
recommended way to go is to use refinement. The only situation in which
you might consider using random instead, is if you have batch corrected
your data with a graph based correction algorithm, such as BBKNN, but the results of
DA testing will be suboptimal.## Checking valid object
## Running refined sampling with reduced_dim
Once we have defined neighbourhoods, it’s good to take a look at how
big the neighbourhoods are (i.e. how many cells form each
neighbourhood). This affects the power of DA testing. We can check this
out using the plotNhoodSizeHist
function. Empirically, we
found it’s best to have a distribution peaking between 50 and 100.
Otherwise you might consider rerunning makeNhoods
increasing k
and/or prop
(here the
distribution looks ludicrous because it’s a small dataset).
Now we have to count how many cells from each sample are in each neighbourhood. We need to use the cell metadata and specify which column contains the sample information.
## Checking meta.data validity
## Counting cells in neighbourhoods
This adds to the Milo
object a n \times m
matrix, where n is the number of neighbourhoods and m is the number of experimental
samples. Values indicate the number of cells from each sample counted in
a neighbourhood. This count matrix will be used for DA testing.
## 6 x 6 sparse Matrix of class "dgCMatrix"
## B_R1 A_R1 A_R2 B_R2 B_R3 A_R3
## 1 9 6 7 7 6 10
## 2 9 3 3 11 11 8
## 3 15 . 1 22 29 1
## 4 6 3 5 5 7 9
## 5 3 2 1 5 4 7
## 6 12 1 1 9 12 .
Now we are all set to test for differential abundance in
neighbourhoods. We implement this hypothesis testing in a generalized
linear model (GLM) framework, specifically using the Negative Binomial
GLM implementation in edgeR
.
We first need to think about our experimental design. The design
matrix should match samples to a condition of interest. In this case the
Condition
is the covariate we are going to test for.
traj_design <- data.frame(colData(traj_milo))[,c("Sample", "Condition")]
traj_design <- distinct(traj_design)
rownames(traj_design) <- traj_design$Sample
## Reorder rownames to match columns of nhoodCounts(milo)
traj_design <- traj_design[colnames(nhoodCounts(traj_milo)), , drop=FALSE]
traj_design
## Sample Condition
## B_R1 B_R1 B
## A_R1 A_R1 A
## A_R2 A_R2 A
## B_R2 B_R2 B
## B_R3 B_R3 B
## A_R3 A_R3 A
Milo uses an adaptation of the Spatial FDR correction introduced by cydar, which accounts for the overlap between neighbourhoods. Specifically, each hypothesis test P-value is weighted by the reciprocal of the kth nearest neighbour distance. To use this statistic we first need to store the distances between nearest neighbors in the Milo object.
## 'as(<dgTMatrix>, "dgCMatrix")' is deprecated.
## Use 'as(., "CsparseMatrix")' instead.
## See help("Deprecated") and help("Matrix-deprecated").
Now we can do the test, explicitly defining our experimental design.
rownames(traj_design) <- traj_design$Sample
da_results <- testNhoods(traj_milo, design = ~ Condition, design.df = traj_design)
## Using TMM normalisation
## Running with model contrasts
## Performing spatial FDR correction with k-distance weighting
This calculates a Fold-change and corrected P-value for each neighbourhood, which indicates whether there is significant differential abundance between conditions.
## logFC logCPM F PValue FDR Nhood SpatialFDR
## 23 -0.09648099 15.12885 0.02139997 0.8839891 0.8839891 23 0.8839891
## 22 -0.30855800 15.66943 0.66366355 0.5214577 0.5431851 22 0.5406543
## 2 0.34068687 15.91731 1.01207068 0.4492811 0.4883490 2 0.4856839
## 5 -0.51995590 15.17192 0.70093856 0.4044649 0.4596192 5 0.4565402
## 20 -0.48852549 15.44205 0.91380966 0.3566490 0.4245821 20 0.4205282
## 15 0.47181324 15.99131 1.37066497 0.2846611 0.3558264 15 0.3502434
To visualize DA results, we build an abstracted graph of neighbourhoods that we can superimpose on the single-cell embedding.
traj_milo <- buildNhoodGraph(traj_milo)
plotUMAP(traj_milo) + plotNhoodGraphDA(traj_milo, da_results, alpha=0.05) +
plot_layout(guides="collect")
## Adding nhood effect sizes to neighbourhood graph attributes
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] MouseThymusAgeing_1.14.0 patchwork_1.3.0
## [3] dplyr_1.1.4 scran_1.35.0
## [5] scater_1.35.0 ggplot2_3.5.1
## [7] scuttle_1.17.0 SingleCellExperiment_1.29.1
## [9] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [11] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [13] IRanges_2.41.1 S4Vectors_0.45.2
## [15] BiocGenerics_0.53.3 generics_0.1.3
## [17] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [19] miloR_2.3.0 edgeR_4.5.0
## [21] limma_3.63.2 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3 jsonlite_1.8.9
## [4] magrittr_2.0.3 ggbeeswarm_0.7.2 farver_2.1.2
## [7] rmarkdown_2.29 zlibbioc_1.52.0 vctrs_0.6.5
## [10] memoise_2.0.1 htmltools_0.5.8.1 S4Arrays_1.7.1
## [13] AnnotationHub_3.15.0 curl_6.0.1 BiocNeighbors_2.1.1
## [16] SparseArray_1.7.2 sass_0.4.9 pracma_2.4.4
## [19] bslib_0.8.0 cachem_1.1.0 buildtools_1.0.0
## [22] igraph_2.1.1 mime_0.12 lifecycle_1.0.4
## [25] pkgconfig_2.0.3 rsvd_1.0.5 Matrix_1.7-1
## [28] R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [31] digest_0.6.37 numDeriv_2016.8-1.1 colorspace_2.1-1
## [34] AnnotationDbi_1.69.0 dqrng_0.4.1 irlba_2.3.5.1
## [37] ExperimentHub_2.15.0 RSQLite_2.3.8 beachmat_2.23.2
## [40] labeling_0.4.3 filelock_1.0.3 fansi_1.0.6
## [43] httr_1.4.7 polyclip_1.10-7 abind_1.4-8
## [46] compiler_4.4.2 bit64_4.5.2 withr_3.0.2
## [49] BiocParallel_1.41.0 viridis_0.6.5 DBI_1.2.3
## [52] ggforce_0.4.2 MASS_7.3-61 rappdirs_0.3.3
## [55] DelayedArray_0.33.2 bluster_1.17.0 gtools_3.9.5
## [58] tools_4.4.2 vipor_0.4.7 beeswarm_0.4.0
## [61] glue_1.8.0 grid_4.4.2 cluster_2.1.6
## [64] gtable_0.3.6 tidyr_1.3.1 BiocSingular_1.23.0
## [67] tidygraph_1.3.1 ScaledMatrix_1.15.0 metapod_1.15.0
## [70] utf8_1.2.4 XVector_0.47.0 ggrepel_0.9.6
## [73] BiocVersion_3.21.1 pillar_1.9.0 stringr_1.5.1
## [76] splines_4.4.2 tweenr_2.0.3 BiocFileCache_2.15.0
## [79] lattice_0.22-6 FNN_1.1.4.1 bit_4.5.0
## [82] tidyselect_1.2.1 locfit_1.5-9.10 maketools_1.3.1
## [85] Biostrings_2.75.1 knitr_1.49 gridExtra_2.3
## [88] xfun_0.49 graphlayouts_1.2.1 statmod_1.5.0
## [91] stringi_1.8.4 UCSC.utils_1.3.0 yaml_2.3.10
## [94] evaluate_1.0.1 codetools_0.2-20 ggraph_2.2.1
## [97] tibble_3.2.1 BiocManager_1.30.25 cli_3.6.3
## [100] uwot_0.2.2 munsell_0.5.1 jquerylib_0.1.4
## [103] Rcpp_1.0.13-1 dbplyr_2.5.0 png_0.1-8
## [106] parallel_4.4.2 blob_1.2.4 viridisLite_0.4.2
## [109] scales_1.3.0 purrr_1.0.2 crayon_1.5.3
## [112] rlang_1.1.4 cowplot_1.1.3 KEGGREST_1.47.0