Utilities for multi-modal single-cell analyses

Overview

The mumosa package implements a variety of utilities for analyses of single-cell data with multiple modalities. This usually refers to single-cell RNA-seq experiments with proteomics, epigenomics or other data collected from the same cells. The aim is to investigate how different modalities relate to each other via analyses of correlations, and to combine data together from multiple modalities for integrated downstream analyses. The actual analyses of each individual modality are deferred to other packages; the scope of mumosa is limited to the sharing of information across modalities.

Setting up the dataset

To demonstrate the functionalities of this package, we will use a subset of a CITE-seq experiment from the scRNAseq package. The main Experiment contains the RNA-seq counts while the adt alternative Experiment contains the CITE-seq counts - see the SingleCellExperiment package documentation for more details.

library(scater)
library(scran)
library(scRNAseq)
sce <- KotliarovPBMCData()
sce <- sce[,1:1000] # subset for speed.
sce
## class: SingleCellExperiment 
## dim: 32738 1000 
## metadata(0):
## assays(1): counts
## rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
## rowData names(0):
## colnames(1000): AAACCTGAGAGCCCAA_H1B1ln1 AAACCTGAGGCGTACA_H1B1ln1 ...
##   ATAACGCGTTTGGGCC_H1B1ln1 ATAACGCTCAACGCTA_H1B1ln1
## colData names(24): nGene nUMI ... dmx_hto_match timepoint
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(1): ADT

We perform some cursory analyses on the RNA component:

stats <- perCellQCMetrics(sce, subsets=list(Mito=grep("^MT-", rownames(sce))))
filter <- quickPerCellQC(stats, sub.fields="subsets_Mito_percent")
sce <- sce[,!filter$discard]
sce <- logNormCounts(sce)
dec <- modelGeneVar(sce)
set.seed(10000)
sce <- runPCA(sce, ncomponents=25, subset_row=getTopHVGs(dec, n=5000))

And again on the protein component.

# TODO: sync this with the book.
sceA <- as(altExp(sce), "SingleCellExperiment")
statsA <- perCellQCMetrics(sceA)
keep <- statsA$detected > 50 
sceA <- sceA[,keep]

library(DropletUtils)
amb <- inferAmbience(assay(sceA))
sceA <- computeMedianFactors(sceA, reference=amb)
sceA <- logNormCounts(sceA)

set.seed(10000)
sceA <- runPCA(sceA, ncomponents=15)

Finally, we put everything back into the same object, only considering the cells passing QC in both modalities.

sce2 <- sce[,keep]
altExp(sce2) <- sceA

Rescaling by neighbors

The easiest way to combine data for the same set of cells is to simply cbind their matrices together prior to downstream analyses like clustering. However, this requires some rescaling to adjust for the differences in the number of features and variation of each modality; otherwise, the modality with more features or stronger (technical) variation would just dominate later calculations. rescaleByNeighbors() quantifies the “noise” of each modality using on the median distance to the k-th nearest neighbor, and then rescales each expression/PC matrix by that distance to equalize the noise across modalities.

library(mumosa)
output <- rescaleByNeighbors(list(reducedDim(sce2), reducedDim(altExp(sce2))))
dim(output)
## [1] 911  40

The result is a cbinded matrix that can be used directly in downstream analyses like clustering and dimensionality reduction.

set.seed(100001)
library(bluster)
sce2$combined.clustering <- clusterRows(output, NNGraphParam())

reducedDim(sce2, "combined") <- output
sce2 <- runTSNE(sce2, dimred="combined")
plotTSNE(sce2, colour_by="combined.clustering")

Advanced users can also adjust the weight given to each modality via the weights= argument; a modality assigned a weight of 2 will mean that the rescaled distance is twice that of a modality with a weight of 1.

Intersecting clusters

In other situations, we may already have a clustering for each modality. For example, we might have one clustering for each of the gene expression and ADT PCs:

g.rna <- buildSNNGraph(sce2, use.dimred="PCA")
rna.clusters <- igraph::cluster_walktrap(g.rna)$membership
table(rna.clusters)
## rna.clusters
##   1   2   3   4   5   6   7 
## 115 218  24  84  92 254 124
g.adt <- buildSNNGraph(altExp(sce2), use.dimred='PCA')
adt.clusters <- igraph::cluster_walktrap(g.adt)$membership
table(adt.clusters)
## adt.clusters
##   1   2   3   4   5   6   7   8 
## 111 191 122 181 122  82  20  82

We can naively “intersect” these clusters by concatenating their identities, as shown below. Cells are only assigned to the same intersected clusters if they belong to the same per-modality clusters. In practice, this often creates a large number of irrelevant clusters with small numbers of cells.

intersected <- paste0(adt.clusters, ".", rna.clusters)
table(intersected)
## intersected
## 1.2 1.4 1.5 2.2 2.5 2.6 3.7 4.2 4.4 4.6 5.1 5.3 5.6 6.2 6.6 7.3 7.7 8.4 8.5 
##  39   2  70 155  20  16 122   6   2 173 115   6   1  18  64  18   2  80   2

A more refined approach is implemented in the intersectClusters() function. This performs the intersection as described but then performs a series of pairwise merges to eliminate the small clusters. It will stop merging when the within-cluster sum of squares exceeds that of the original clustering in any modality; this ensures that any within-modality structure captured by the original clustering is preserved.

intersected2 <- intersectClusters(list(rna.clusters, adt.clusters),
    list(reducedDim(sce2), reducedDim(altExp(sce2))))
table(intersected2)
## intersected2
##   1   2   3   4   5   6   7   8   9 
## 124  92  39  82  84 180 115 171  24

Multi-metric UMAP

Another approach is to take advantage of UMAP’s ability to combine information on different scales. This is achieved by - roughly speaking - creating nearest-neighbor graphs within each modality, and then “intersecting” the graphs such that cells are only considered to be neighbors if they are neighbors in each modality. In this manner, we incorporate locality information from all modalities without explicitly comparing distances. We apply this strategy using the calculateMultiUMAP() function, increasing the number of components to avoid loss of information.

set.seed(100002)
umap.out <- calculateMultiUMAP(list(reducedDim(sce2), reducedDim(altExp(sce2))), 
    n_components=20)
dim(umap.out)
## [1] 911  20

Again, we can use this in downstream analyses like clustering:

library(bluster)
sce2$umap.clustering <- clusterRows(umap.out, NNGraphParam())

And also visualization, though perhaps it is more natural to just compute the UMAP on two dimensions for this purpose. runMultiMap() is just a wrapper around calculateMultiUMAP() that conveniently stores the output in the reducedDims of the input SingleCellExperiment.

set.seed(100002)
sce2 <- runMultiUMAP(sce2, dimred="PCA", extras=list(reducedDim(altExp(sce2))))
plotReducedDim(sce2, "MultiUMAP", colour_by="umap.clustering")

Intersecting graphs

Inspired by the UMAP approach, we can perform intersections on arbitrary graphs with the intersectGraphs() function. This is relevant in cases where we have nearest-neighbor graphs constructed separately for each modality, e.g., for graph-based clustering. We intersect these graphs by setting the weights of edges to the product of the edge weights from each individual graph.

g.com <- intersectGraphs(g.rna, g.adt)

Two cells will only be connected by a high-weighted edge if that edge is of high weight in each original graph, i.e., cells must be nearest neighbors in all modalities to be considered as nearest neighbors in the intersection. We can then use this graph for clustering, which hopefully gives us clusters where cells are similar across all modalities.

com.clusters <- igraph::cluster_walktrap(g.com)$membership
table(com.clusters)
## com.clusters
##   1   2   3   4   5   6   7   8   9  10 
## 123 124 118 169 178  70  21  22  84   2

Correlation analyses

Given two modalities, we may be interested in knowing which features in one modality are correlated to the features in another modality. The computeCorrelations() function does exactly this:

cor.all <- computeCorrelations(sce2, altExp(sce2)[1:5,])
cor.all[order(cor.all$p.value),]
## DataFrame with 163690 rows and 5 columns
##           feature1      feature2       rho     p.value         FDR
##        <character>   <character> <numeric>   <numeric>   <numeric>
## 1            CD79A     BTLA_PROT  0.443223 4.01052e-45 2.89580e-40
## 2         HLA-DRB1    CD123_PROT  0.430457 2.17126e-42 7.83878e-38
## 3          HLA-DRA    CD123_PROT  0.403383 5.75021e-37 1.38398e-32
## 4             FCN1     CD13_PROT  0.397864 6.38599e-36 1.15275e-31
## 5            TCL1A     BTLA_PROT  0.383289 2.97413e-33 4.29495e-29
## ...            ...           ...       ...         ...         ...
## 163686      snoU13 AnnexinV_PROT       NaN          NA          NA
## 163687      snoU13     BTLA_PROT       NaN          NA          NA
## 163688      snoU13    CD117_PROT       NaN          NA          NA
## 163689      snoU13    CD123_PROT       NaN          NA          NA
## 163690      snoU13     CD13_PROT       NaN          NA          NA

For multi-batch experiments, we can specify blocking factors to avoid being confounded by the batch effect. Each batch receives equal weight when averaging correlations and in computing the final combined p-value.

b <- rep(1:3, length.out=ncol(sce2))
cor.all.batch <- computeCorrelations(sce2, altExp(sce2)[1:5,], block=b)

However, this approach can be very slow when dealing with large numbers of features; indeed, one may note the subsetting to the first 5 features in the code above. Users can enable parallelization via BPPARAM= to speed up the process but this will only go so far.

An alternative approach is to, for each feature in one modality, perform an approximate search for the top most-correlated features in another modality. This assumes that only the strongest correlations (positive or negative) are of actual interest, while the bulk of weak correlations can be ignored. We use the findTopCorrelations() function with a specified number of top features to extract:

set.seed(100001) # for IRLBA.
top.cor <- findTopCorrelations(sce2[1:100,], y=altExp(sce2), number=10)
top.cor
## List of length 2
## names(2): positive negative
top.cor$positive
## DataFrame with 530 rows and 5 columns
##         feature1              feature2        rho   p.value       FDR
##      <character>           <character>  <numeric> <numeric> <numeric>
## 1   RP11-34P13.7 RatIgG2bkIsotype_PROT  0.0550420 0.0484271         1
## 2   RP11-34P13.7            CD194_PROT  0.0519342 0.0586247         1
## 3   RP11-34P13.7             CD19_PROT  0.0310772 0.1743962         1
## 4   RP11-34P13.7             CD34_PROT  0.0301584 0.1816169         1
## 5   RP11-34P13.7             CD1c_PROT  0.0300619 0.1823858         1
## ...          ...                   ...        ...       ...       ...
## 526     C1orf222             CD19_PROT 0.03895679  0.120066         1
## 527     C1orf222            CD141_PROT 0.03693348  0.132724         1
## 528     C1orf222             CD10_PROT 0.02722832  0.205866         1
## 529     C1orf222            CD184_PROT 0.00819402  0.402461         1
## 530     C1orf222            CD273_PROT 0.00572678  0.431477         1

This returns the top 10 features in the CITE-seq data (y) for each feature in the main RNA-seq data. As the search is approximate, some features may not be ranked as highly as they would be under an exact search - the approximation quality can be modified by increasing the number of PCs (d) used for data compression. Blocking is also supported with block=, in which case correlations are only computed within each level of the blocking factor.

Session information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] bluster_1.17.0              mumosa_1.15.0              
##  [3] DropletUtils_1.27.0         scRNAseq_2.20.0            
##  [5] scran_1.35.0                scater_1.35.0              
##  [7] ggplot2_3.5.1               scuttle_1.17.0             
##  [9] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [11] Biobase_2.67.0              GenomicRanges_1.59.1       
## [13] GenomeInfoDb_1.43.2         IRanges_2.41.1             
## [15] S4Vectors_0.45.2            BiocGenerics_0.53.3        
## [17] generics_0.1.3              MatrixGenerics_1.19.0      
## [19] matrixStats_1.4.1           knitr_1.49                 
## [21] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] sys_3.4.3                 jsonlite_1.8.9           
##   [3] magrittr_2.0.3            ggbeeswarm_0.7.2         
##   [5] GenomicFeatures_1.59.1    gypsum_1.3.0             
##   [7] farver_2.1.2              rmarkdown_2.29           
##   [9] BiocIO_1.17.1             zlibbioc_1.52.0          
##  [11] vctrs_0.6.5               DelayedMatrixStats_1.29.0
##  [13] memoise_2.0.1             Rsamtools_2.23.1         
##  [15] RCurl_1.98-1.16           htmltools_0.5.8.1        
##  [17] S4Arrays_1.7.1            AnnotationHub_3.15.0     
##  [19] curl_6.0.1                BiocNeighbors_2.1.1      
##  [21] Rhdf5lib_1.29.0           SparseArray_1.7.2        
##  [23] rhdf5_2.51.0              sass_0.4.9               
##  [25] alabaster.base_1.7.2      bslib_0.8.0              
##  [27] alabaster.sce_1.7.0       httr2_1.0.7              
##  [29] cachem_1.1.0              ResidualMatrix_1.17.0    
##  [31] buildtools_1.0.0          GenomicAlignments_1.43.0 
##  [33] igraph_2.1.1              lifecycle_1.0.4          
##  [35] pkgconfig_2.0.3           rsvd_1.0.5               
##  [37] Matrix_1.7-1              R6_2.5.1                 
##  [39] fastmap_1.2.0             GenomeInfoDbData_1.2.13  
##  [41] digest_0.6.37             colorspace_2.1-1         
##  [43] AnnotationDbi_1.69.0      dqrng_0.4.1              
##  [45] irlba_2.3.5.1             ExperimentHub_2.15.0     
##  [47] RSQLite_2.3.8             beachmat_2.23.2          
##  [49] labeling_0.4.3            filelock_1.0.3           
##  [51] fansi_1.0.6               httr_1.4.7               
##  [53] abind_1.4-8               compiler_4.4.2           
##  [55] bit64_4.5.2               withr_3.0.2              
##  [57] BiocParallel_1.41.0       viridis_0.6.5            
##  [59] DBI_1.2.3                 R.utils_2.12.3           
##  [61] alabaster.ranges_1.7.0    HDF5Array_1.35.1         
##  [63] alabaster.schemas_1.7.0   rappdirs_0.3.3           
##  [65] DelayedArray_0.33.2       rjson_0.2.23             
##  [67] tools_4.4.2               vipor_0.4.7              
##  [69] beeswarm_0.4.0            R.oo_1.27.0              
##  [71] glue_1.8.0                batchelor_1.23.0         
##  [73] restfulr_0.0.15           rhdf5filters_1.19.0      
##  [75] grid_4.4.2                Rtsne_0.17               
##  [77] cluster_2.1.6             gtable_0.3.6             
##  [79] R.methodsS3_1.8.2         ensembldb_2.31.0         
##  [81] BiocSingular_1.23.0       ScaledMatrix_1.15.0      
##  [83] metapod_1.15.0            utf8_1.2.4               
##  [85] XVector_0.47.0            ggrepel_0.9.6            
##  [87] BiocVersion_3.21.1        pillar_1.9.0             
##  [89] limma_3.63.2              dplyr_1.1.4              
##  [91] BiocFileCache_2.15.0      lattice_0.22-6           
##  [93] FNN_1.1.4.1               rtracklayer_1.67.0       
##  [95] bit_4.5.0                 tidyselect_1.2.1         
##  [97] locfit_1.5-9.10           maketools_1.3.1          
##  [99] Biostrings_2.75.1         gridExtra_2.3            
## [101] ProtGenerics_1.39.0       edgeR_4.5.0              
## [103] xfun_0.49                 statmod_1.5.0            
## [105] UCSC.utils_1.3.0          lazyeval_0.2.2           
## [107] yaml_2.3.10               evaluate_1.0.1           
## [109] codetools_0.2-20          tibble_3.2.1             
## [111] alabaster.matrix_1.7.2    BiocManager_1.30.25      
## [113] cli_3.6.3                 uwot_0.2.2               
## [115] munsell_0.5.1             jquerylib_0.1.4          
## [117] Rcpp_1.0.13-1             dbplyr_2.5.0             
## [119] png_0.1-8                 XML_3.99-0.17            
## [121] parallel_4.4.2            blob_1.2.4               
## [123] AnnotationFilter_1.31.0   sparseMatrixStats_1.19.0 
## [125] bitops_1.0-9              alabaster.se_1.7.0       
## [127] viridisLite_0.4.2         scales_1.3.0             
## [129] crayon_1.5.3              rlang_1.1.4              
## [131] KEGGREST_1.47.0