Getting started with MuData for MultiAssayExperiment

Introduction

Multimodal data format — MuDatahas been introduced to address the need for cross-platform standard for sharing large-scale multimodal omics data. Importantly, it develops ideas of and is compatible with AnnData standard for storing raw and derived data for unimodal datasets.

In Bioconductor, multimodal datasets have been stored in MultiAssayExperiment (MAE) objects. This MuData package provides functionality to read data from MuData files into MAE objects as well as to save MAE objects into H5MU files.

Installation

The most recent dev build can be installed from GitHub:

library(remotes)
remotes::install_github("ilia-kats/MuData")

Stable version of MuData will be available in future bioconductor versions.

Loading libraries

library(MuData)
library(MultiAssayExperiment)

library(rhdf5)

Writing H5MU files

We’ll use a simple MAE object from the MultiAssayExperiment package that we’ll then save in a H5MU file.

data(miniACC)
miniACC
#> A MultiAssayExperiment object of 5 listed
#>  experiments with user-defined names and respective classes.
#>  Containing an ExperimentList class object of length 5:
#>  [1] RNASeq2GeneNorm: SummarizedExperiment with 198 rows and 79 columns
#>  [2] gistict: SummarizedExperiment with 198 rows and 90 columns
#>  [3] RPPAArray: SummarizedExperiment with 33 rows and 46 columns
#>  [4] Mutations: matrix with 97 rows and 90 columns
#>  [5] miRNASeqGene: SummarizedExperiment with 471 rows and 80 columns
#> Functionality:
#>  experiments() - obtain the ExperimentList instance
#>  colData() - the primary/phenotype DataFrame
#>  sampleMap() - the sample coordination DataFrame
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment
#>  *Format() - convert into a long or wide DataFrame
#>  assays() - convert ExperimentList to a SimpleList of matrices
#>  exportClass() - save data to flat files

We will now write its contents into an H5MU file with WriteH5MU:

writeH5MU(miniACC, "miniacc.h5mu")

Reading H5MU files

We can manually check the top level of the structure of the file:

rhdf5::h5ls("miniacc.h5mu", recursive = FALSE)
#>   group   name     otype dclass dim
#> 0     /    mod H5I_GROUP           
#> 1     /    obs H5I_GROUP           
#> 2     /   obsm H5I_GROUP           
#> 3     / obsmap H5I_GROUP           
#> 4     /    uns H5I_GROUP           
#> 5     /    var H5I_GROUP

Or dig deeper into the file:

h5 <- rhdf5::H5Fopen("miniacc.h5mu")
h5&'mod'
#> HDF5 GROUP 
#>         name /mod
#>     filename 
#> 
#>              name     otype dclass dim
#> 0 Mutations       H5I_GROUP           
#> 1 RNASeq2GeneNorm H5I_GROUP           
#> 2 RPPAArray       H5I_GROUP           
#> 3 gistict         H5I_GROUP           
#> 4 miRNASeqGene    H5I_GROUP
rhdf5::H5close()

Creating MAE objects from H5MU files

This package provides ReadH5MU to create an object with data from an H5MU file. Since H5MU structure has been designed to accommodate more structured information than MAE, only some data will be read. For instance, MAE has no support for loading multimodal embeddings or pairwise graphs.

acc <- readH5MU("miniacc.h5mu")
#> Reading as SingleCellExperiment where the original object class is matrix
#> Warning: sampleMap[['assay']] coerced with as.factor()
acc
#> A MultiAssayExperiment object of 5 listed
#>  experiments with user-defined names and respective classes.
#>  Containing an ExperimentList class object of length 5:
#>  [1] RNASeq2GeneNorm: SummarizedExperiment with 198 rows and 79 columns
#>  [2] gistict: SummarizedExperiment with 198 rows and 90 columns
#>  [3] RPPAArray: SummarizedExperiment with 33 rows and 46 columns
#>  [4] Mutations: SingleCellExperiment with 97 rows and 90 columns
#>  [5] miRNASeqGene: SummarizedExperiment with 471 rows and 80 columns
#> Functionality:
#>  experiments() - obtain the ExperimentList instance
#>  colData() - the primary/phenotype DataFrame
#>  sampleMap() - the sample coordination DataFrame
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment
#>  *Format() - convert into a long or wide DataFrame
#>  assays() - convert ExperimentList to a SimpleList of matrices
#>  exportClass() - save data to flat files

Importantly, we recover the information from the original MAE object:

head(colData(miniACC)[,1:4])
#> DataFrame with 6 rows and 4 columns
#>                 patientID years_to_birth vital_status days_to_death
#>               <character>      <integer>    <integer>     <integer>
#> TCGA-OR-A5J1 TCGA-OR-A5J1             58            1          1355
#> TCGA-OR-A5J2 TCGA-OR-A5J2             44            1          1677
#> TCGA-OR-A5J3 TCGA-OR-A5J3             23            0            NA
#> TCGA-OR-A5J4 TCGA-OR-A5J4             23            1           423
#> TCGA-OR-A5J5 TCGA-OR-A5J5             30            1           365
#> TCGA-OR-A5J6 TCGA-OR-A5J6             29            0            NA
head(colData(acc)[,1:4])
#> DataFrame with 6 rows and 4 columns
#>                 patientID years_to_birth vital_status days_to_death
#>               <character>      <integer>    <integer>     <integer>
#> TCGA-OR-A5J1 TCGA-OR-A5J1             58            1          1355
#> TCGA-OR-A5J2 TCGA-OR-A5J2             44            1          1677
#> TCGA-OR-A5J3 TCGA-OR-A5J3             23            0            NA
#> TCGA-OR-A5J4 TCGA-OR-A5J4             23            1           423
#> TCGA-OR-A5J5 TCGA-OR-A5J5             30            1           365
#> TCGA-OR-A5J6 TCGA-OR-A5J6             29            0            NA

Features metadata is also recovered:

head(rowData(miniACC[["gistict"]]))
#> DataFrame with 6 rows and 3 columns
#>        Gene.Symbol    Locus.ID    Cytoband
#>        <character> <character> <character>
#> DIRAS3      DIRAS3        9077      1p31.3
#> MAPK14      MAPK14        1432     6p21.31
#> YAP1          YAP1       10413     11q22.1
#> CDKN1B      CDKN1B        1027     12p13.1
#> ERBB2        ERBB2        2064       17q12
#> G6PD          G6PD        2539        Xq28
head(rowData(acc[["gistict"]]))
#> DataFrame with 6 rows and 3 columns
#>        Gene.Symbol    Locus.ID    Cytoband
#>        <character> <character> <character>
#> DIRAS3      DIRAS3        9077      1p31.3
#> MAPK14      MAPK14        1432     6p21.31
#> YAP1          YAP1       10413     11q22.1
#> CDKN1B      CDKN1B        1027     12p13.1
#> ERBB2        ERBB2        2064       17q12
#> G6PD          G6PD        2539        Xq28

Backed objects

It is possible to read H5MU files while keeping matrices (both .X and .layers) on disk.

acc_b <- readH5MU("miniacc.h5mu", backed = TRUE)
#> Reading as SingleCellExperiment where the original object class is matrix
#> Warning: sampleMap[['assay']] coerced with as.factor()
assay(acc_b, "RNASeq2GeneNorm")[1:5,1:3]
#> <5 x 3> DelayedMatrix object of type "double":
#>        TCGA-OR-A5J1-01A-11R-A29S-07 TCGA-OR-A5J2-01A-11R-A29S-07
#> DIRAS3                    1487.0317                       9.6631
#> MAPK14                     778.5783                    2823.6469
#> YAP1                      1009.6061                    2305.0590
#> CDKN1B                    2101.3449                    4248.9584
#> ERBB2                      651.2968                     246.4098
#>        TCGA-OR-A5J3-01A-11R-A29S-07
#> DIRAS3                      18.9602
#> MAPK14                    1061.7686
#> YAP1                      1561.2502
#> CDKN1B                    1348.5410
#> ERBB2                       90.0607

The data in the assay is a DelayedMatrix object:

class(assay(acc_b, "RNASeq2GeneNorm"))
#> [1] "DelayedMatrix"
#> attr(,"package")
#> [1] "DelayedArray"

This is in contrast to the acc object that has matrices in memory:

assay(acc, "RNASeq2GeneNorm")[1:5,1:3]
#>        TCGA-OR-A5J1-01A-11R-A29S-07 TCGA-OR-A5J2-01A-11R-A29S-07
#> DIRAS3                    1487.0317                       9.6631
#> MAPK14                     778.5783                    2823.6469
#> YAP1                      1009.6061                    2305.0590
#> CDKN1B                    2101.3449                    4248.9584
#> ERBB2                      651.2968                     246.4098
#>        TCGA-OR-A5J3-01A-11R-A29S-07
#> DIRAS3                      18.9602
#> MAPK14                    1061.7686
#> YAP1                      1561.2502
#> CDKN1B                    1348.5410
#> ERBB2                       90.0607
class(assay(acc, "RNASeq2GeneNorm"))
#> [1] "matrix" "array"

References

Session Info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SingleCellMultiModal_1.18.0 scater_1.35.0              
#>  [3] ggplot2_3.5.1               scuttle_1.17.0             
#>  [5] CiteFuse_1.19.0             MultiAssayExperiment_1.33.1
#>  [7] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
#>  [9] Biobase_2.67.0              GenomicRanges_1.59.1       
#> [11] GenomeInfoDb_1.43.2         IRanges_2.41.2             
#> [13] MatrixGenerics_1.19.0       matrixStats_1.4.1          
#> [15] MuData_1.11.0               rhdf5_2.51.0               
#> [17] S4Vectors_0.45.2            BiocGenerics_0.53.3        
#> [19] generics_0.1.3              Matrix_1.7-1               
#> [21] BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.2            filelock_1.0.3           tibble_3.2.1            
#>   [4] polyclip_1.10-7          lifecycle_1.0.4          edgeR_4.5.1             
#>   [7] lattice_0.22-6           MASS_7.3-61              magrittr_2.0.3          
#>  [10] limma_3.63.2             plotly_4.10.4            sass_0.4.9              
#>  [13] rmarkdown_2.29           jquerylib_0.1.4          yaml_2.3.10             
#>  [16] metapod_1.15.0           cowplot_1.1.3            bayesm_3.1-6            
#>  [19] DBI_1.2.3                buildtools_1.0.0         RColorBrewer_1.1-3      
#>  [22] abind_1.4-8              zlibbioc_1.52.0          Rtsne_0.17              
#>  [25] purrr_1.0.2              mixtools_2.0.0           ggraph_2.2.1            
#>  [28] tensorA_0.36.2.1         tweenr_2.0.3             rappdirs_0.3.3          
#>  [31] GenomeInfoDbData_1.2.13  ggrepel_0.9.6            irlba_2.3.5.1           
#>  [34] maketools_1.3.1          pheatmap_1.0.12          dqrng_0.4.1             
#>  [37] codetools_0.2-20         DelayedArray_0.33.3      ggforce_0.4.2           
#>  [40] tidyselect_1.2.1         UCSC.utils_1.3.0         farver_2.1.2            
#>  [43] ScaledMatrix_1.15.0      viridis_0.6.5            BiocFileCache_2.15.0    
#>  [46] jsonlite_1.8.9           BiocNeighbors_2.1.1      tidygraph_1.3.1         
#>  [49] ggridges_0.5.6           survival_3.7-0           dbscan_1.2-0            
#>  [52] segmented_2.1-3          tools_4.4.2              Rcpp_1.0.13-1           
#>  [55] glue_1.8.0               BiocBaseUtils_1.9.0      gridExtra_2.3           
#>  [58] SparseArray_1.7.2        xfun_0.49                HDF5Array_1.35.2        
#>  [61] dplyr_1.1.4              withr_3.0.2              BiocManager_1.30.25     
#>  [64] fastmap_1.2.0            rhdf5filters_1.19.0      bluster_1.17.0          
#>  [67] fansi_1.0.6              digest_0.6.37            rsvd_1.0.5              
#>  [70] mime_0.12                R6_2.5.1                 colorspace_2.1-1        
#>  [73] RSQLite_2.3.9            utf8_1.2.4               tidyr_1.3.1             
#>  [76] data.table_1.16.2        robustbase_0.99-4-1      graphlayouts_1.2.1      
#>  [79] httr_1.4.7               htmlwidgets_1.6.4        S4Arrays_1.7.1          
#>  [82] uwot_0.2.2               pkgconfig_2.0.3          gtable_0.3.6            
#>  [85] blob_1.2.4               XVector_0.47.0           sys_3.4.3               
#>  [88] htmltools_0.5.8.1        scales_1.3.0             png_0.1-8               
#>  [91] SpatialExperiment_1.17.0 scran_1.35.0             knitr_1.49              
#>  [94] rjson_0.2.23             reshape2_1.4.4           nlme_3.1-166            
#>  [97] curl_6.0.1               cachem_1.1.0             stringr_1.5.1           
#> [100] BiocVersion_3.21.1       parallel_4.4.2           vipor_0.4.7             
#> [103] AnnotationDbi_1.69.0     pillar_1.9.0             grid_4.4.2              
#> [106] vctrs_0.6.5              randomForest_4.7-1.2     BiocSingular_1.23.0     
#> [109] dbplyr_2.5.0             beachmat_2.23.3          cluster_2.1.6           
#> [112] beeswarm_0.4.0           evaluate_1.0.1           magick_2.8.5            
#> [115] cli_3.6.3                locfit_1.5-9.10          compiler_4.4.2          
#> [118] rlang_1.1.4              crayon_1.5.3             labeling_0.4.3          
#> [121] plyr_1.8.9               ggbeeswarm_0.7.2         stringi_1.8.4           
#> [124] viridisLite_0.4.2        BiocParallel_1.41.0      munsell_0.5.1           
#> [127] Biostrings_2.75.1        lazyeval_0.2.2           compositions_2.0-8      
#> [130] ExperimentHub_2.15.0     bit64_4.5.2              Rhdf5lib_1.29.0         
#> [133] KEGGREST_1.47.0          statmod_1.5.0            AnnotationHub_3.15.0    
#> [136] kernlab_0.9-33           igraph_2.1.1             memoise_2.0.1           
#> [139] bslib_0.8.0              DEoptimR_1.1-3-1         bit_4.5.0.1