CITE-seq data with MultiAssayExperiment and MuData

Introduction

CITE-seq data provide RNA and surface protein counts for the same cells. This tutorial shows how MuData can be integrated into with Bioconductor workflows to analyse CITE-seq data.

Installation

The most recent dev build can be installed from GitHub:

library(remotes)
remotes::install_github("ilia-kats/MuData")

Stable version of MuData will be available in future bioconductor versions.

Loading libraries

library(MuData)
library(SingleCellExperiment)
library(MultiAssayExperiment)
library(SingleCellMultiModal)
library(scater)

library(rhdf5)

Loading data

We will use CITE-seq data accessible with the SingleCellMultiModal Bioconductor package, which was originally described in Stoeckius et al., 2017.

mae <- CITEseq(
    DataType="cord_blood", modes="*", dry.run=FALSE, version="1.0.0"
)
#> Dataset: cord_blood
#> Working on: scADT_Counts
#> Working on: scRNAseq_Counts
#> Working on: coldata_scRNAseq
#> Working on: scADT_clrCounts
#> see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> see ?SingleCellMultiModal and browseVignettes('SingleCellMultiModal') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> Warning: 'ExperimentList' contains 'data.frame' or 'DataFrame',
#>   potential for errors with mixed data types
#> harmonizing input:
#>   removing 2277 sampleMap rows with 'primary' not in colData

mae
#> A MultiAssayExperiment object of 3 listed
#>  experiments with user-defined names and respective classes.
#>  Containing an ExperimentList class object of length 3:
#>  [1] scADT: matrix with 13 rows and 7858 columns
#>  [2] scADT_clr: matrix with 13 rows and 7858 columns
#>  [3] scRNAseq: matrix with 36280 rows and 7858 columns
#> Functionality:
#>  experiments() - obtain the ExperimentList instance
#>  colData() - the primary/phenotype DataFrame
#>  sampleMap() - the sample coordination DataFrame
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment
#>  *Format() - convert into a long or wide DataFrame
#>  assays() - convert ExperimentList to a SimpleList of matrices
#>  exportClass() - save data to flat files

We see two modalities in the object — scRNAseq and scADT, the latter providing counts for antibody-derived tags. Notably, each experiment is a matrix.

Processing ADT data

While CITE-seq analysis workflows such as CiteFuse should be consulted for more details, below we exemplify simple data transformation in order to demonstrate how their output can be saved to an H5MU file later on.

For ADT counts, we will apply CLR transformation following Hao et al., 2020:

# Define CLR transformation as in the Seurat workflow
clr <- function(data) t(
  apply(data, 1, function(x) log1p(
    x / (exp(sum(log1p(x[x > 0]), na.rm = TRUE) / length(x)))
  ))
)

We will make the ADT modality a SingleCellExperiment object and add an assay with CLR-transformed counts:

adt_counts <- mae[["scADT"]]

mae[["scADT"]] <- SingleCellExperiment(adt_counts)
assay(mae[["scADT"]], "clr") <- clr(adt_counts)

We will also generate reduced dimensions taking advantage of the functionality in the scater package:

mae[["scADT"]] <- runPCA(
  mae[["scADT"]], exprs_values = "clr", ncomponents = 20
)
#> Warning in check_numbers(k = k, nu = nu, nv = nv, limit = min(dim(x)) - : more
#> singular values/vectors requested than available
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : You're computing too large a percentage of total singular values, use a
#> standard svd instead.
#> Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
#> TRUE, : did not converge--results might be invalid!; try increasing work or
#> maxit
plotReducedDim(mae[["scADT"]], dimred = "PCA",
               by_exprs_values = "clr", colour_by = "CD3")

plotReducedDim(mae[["scADT"]], dimred = "PCA",
               by_exprs_values = "clr", colour_by = "CD14")

Writing H5MU files

We can write the contents of the MultiAssayExperiment object into an H5MU file:

writeH5MU(mae, "cord_blood_citeseq.h5mu")

We can check that both modalities were written to the file, whether it was a matrix for RNA or SingleCellExperiment for ADT:

h5 <- rhdf5::H5Fopen("cord_blood_citeseq.h5mu")
h5ls(H5Gopen(h5, "mod"), recursive = FALSE)
#>   group      name     otype dclass dim
#> 0     /     scADT H5I_GROUP           
#> 1     / scADT_clr H5I_GROUP           
#> 2     /  scRNAseq H5I_GROUP

… both assays for ADT — raw counts are stored in X and CLR-transformed counts are in the corresponding layer:

h5ls(H5Gopen(h5, "mod/scADT"), recursive = FALSE)
#>   group   name       otype  dclass       dim
#> 0     /      X H5I_DATASET INTEGER 13 x 7858
#> 1     / layers   H5I_GROUP                  
#> 2     /    obs   H5I_GROUP                  
#> 3     /   obsm   H5I_GROUP                  
#> 4     /    var   H5I_GROUP
h5ls(H5Gopen(h5, "mod/scADT/layers"), recursive = FALSE)
#>   group name       otype dclass       dim
#> 0     /  clr H5I_DATASET  FLOAT 13 x 7858

… as well as reduced dimensions (PCA):

h5ls(H5Gopen(h5, "mod/scADT/obsm"), recursive = FALSE)
#>   group name       otype dclass       dim
#> 0     /  PCA H5I_DATASET  FLOAT 12 x 7858
# There is an alternative way to access groups:
# h5&'mod'&'scADT'&'obsm'
rhdf5::H5close()

References

  • Muon: multimodal omics analysis framework preprint

  • mudata (Python) documentation

  • muon documentation and web page

  • Stoeckius, M., Hafemeister, C., Stephenson, W., Houck-Loomis, B., Chattopadhyay, P.K., Swerdlow, H., Satija, R. and Smibert, P., 2017. Simultaneous epitope and transcriptome measurement in single cells. Nature methods, 14(9), pp.865-868.

  • Hao, Y., Hao, S., Andersen-Nissen, E., Mauck III, W.M., Zheng, S., Butler, A., Lee, M.J., Wilk, A.J., Darby, C., Zager, M. and Hoffman, P., 2021. Integrated analysis of multimodal single-cell data. Cell.

Session Info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SingleCellMultiModal_1.18.0 scater_1.35.0              
#>  [3] ggplot2_3.5.1               scuttle_1.17.0             
#>  [5] CiteFuse_1.19.0             MultiAssayExperiment_1.33.1
#>  [7] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
#>  [9] Biobase_2.67.0              GenomicRanges_1.59.1       
#> [11] GenomeInfoDb_1.43.2         IRanges_2.41.2             
#> [13] MatrixGenerics_1.19.0       matrixStats_1.4.1          
#> [15] MuData_1.11.0               rhdf5_2.51.0               
#> [17] S4Vectors_0.45.2            BiocGenerics_0.53.3        
#> [19] generics_0.1.3              Matrix_1.7-1               
#> [21] BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.2            filelock_1.0.3           tibble_3.2.1            
#>   [4] polyclip_1.10-7          lifecycle_1.0.4          edgeR_4.5.1             
#>   [7] lattice_0.22-6           MASS_7.3-61              magrittr_2.0.3          
#>  [10] limma_3.63.2             plotly_4.10.4            sass_0.4.9              
#>  [13] rmarkdown_2.29           jquerylib_0.1.4          yaml_2.3.10             
#>  [16] metapod_1.15.0           cowplot_1.1.3            bayesm_3.1-6            
#>  [19] DBI_1.2.3                buildtools_1.0.0         RColorBrewer_1.1-3      
#>  [22] abind_1.4-8              zlibbioc_1.52.0          Rtsne_0.17              
#>  [25] purrr_1.0.2              mixtools_2.0.0           ggraph_2.2.1            
#>  [28] tensorA_0.36.2.1         tweenr_2.0.3             rappdirs_0.3.3          
#>  [31] GenomeInfoDbData_1.2.13  ggrepel_0.9.6            irlba_2.3.5.1           
#>  [34] maketools_1.3.1          pheatmap_1.0.12          dqrng_0.4.1             
#>  [37] codetools_0.2-20         DelayedArray_0.33.3      ggforce_0.4.2           
#>  [40] tidyselect_1.2.1         UCSC.utils_1.3.0         farver_2.1.2            
#>  [43] ScaledMatrix_1.15.0      viridis_0.6.5            BiocFileCache_2.15.0    
#>  [46] jsonlite_1.8.9           BiocNeighbors_2.1.1      tidygraph_1.3.1         
#>  [49] ggridges_0.5.6           survival_3.7-0           dbscan_1.2-0            
#>  [52] segmented_2.1-3          tools_4.4.2              Rcpp_1.0.13-1           
#>  [55] glue_1.8.0               BiocBaseUtils_1.9.0      gridExtra_2.3           
#>  [58] SparseArray_1.7.2        xfun_0.49                dplyr_1.1.4             
#>  [61] withr_3.0.2              BiocManager_1.30.25      fastmap_1.2.0           
#>  [64] rhdf5filters_1.19.0      bluster_1.17.0           fansi_1.0.6             
#>  [67] digest_0.6.37            rsvd_1.0.5               mime_0.12               
#>  [70] R6_2.5.1                 colorspace_2.1-1         RSQLite_2.3.9           
#>  [73] utf8_1.2.4               tidyr_1.3.1              data.table_1.16.2       
#>  [76] robustbase_0.99-4-1      graphlayouts_1.2.1       httr_1.4.7              
#>  [79] htmlwidgets_1.6.4        S4Arrays_1.7.1           uwot_0.2.2              
#>  [82] pkgconfig_2.0.3          gtable_0.3.6             blob_1.2.4              
#>  [85] XVector_0.47.0           sys_3.4.3                htmltools_0.5.8.1       
#>  [88] scales_1.3.0             png_0.1-8                SpatialExperiment_1.17.0
#>  [91] scran_1.35.0             knitr_1.49               rjson_0.2.23            
#>  [94] reshape2_1.4.4           nlme_3.1-166             curl_6.0.1              
#>  [97] cachem_1.1.0             stringr_1.5.1            BiocVersion_3.21.1      
#> [100] parallel_4.4.2           vipor_0.4.7              AnnotationDbi_1.69.0    
#> [103] pillar_1.9.0             grid_4.4.2               vctrs_0.6.5             
#> [106] randomForest_4.7-1.2     BiocSingular_1.23.0      dbplyr_2.5.0            
#> [109] beachmat_2.23.3          cluster_2.1.6            beeswarm_0.4.0          
#> [112] evaluate_1.0.1           magick_2.8.5             cli_3.6.3               
#> [115] locfit_1.5-9.10          compiler_4.4.2           rlang_1.1.4             
#> [118] crayon_1.5.3             labeling_0.4.3           plyr_1.8.9              
#> [121] ggbeeswarm_0.7.2         stringi_1.8.4            viridisLite_0.4.2       
#> [124] BiocParallel_1.41.0      munsell_0.5.1            Biostrings_2.75.1       
#> [127] lazyeval_0.2.2           compositions_2.0-8       ExperimentHub_2.15.0    
#> [130] bit64_4.5.2              Rhdf5lib_1.29.0          KEGGREST_1.47.0         
#> [133] statmod_1.5.0            AnnotationHub_3.15.0     kernlab_0.9-33          
#> [136] igraph_2.1.1             memoise_2.0.1            bslib_0.8.0             
#> [139] DEoptimR_1.1-3-1         bit_4.5.0.1