Welcome to the MOSim
package, a versatile tool for
simulating bulk and single-cell multi-omics data. In this vignette, we
will explore how to create synthetic single-cell data, focusing on
single-cell RNA-seq (scRNA-seq) and single-cell ATAC-seq (scATAC-seq)
data. Using the MOSim
package, you can generate custom
multi-omics datasets for various experimental conditions, making it an
essential resource for testing and validating analysis methods, or
creating benchmark datasets.
Before we dive into the exciting world of data simulation, you’ll
need to install the MOSim
package. You can easily obtain it
from bioconductor using the following commands:
The core of data simulation lies in the sc_mosim function, which allows you to create synthetic single-cell multi-omics data. Let’s explore a typical example of its usage, using the default dataset loaded in the package:
library(MOSim)
# Create a list of omics data types (e.g., scRNA-seq and scATAC-seq)
omicsList <- sc_omicData(list("scRNA-seq", "scATAC-seq"),
data = NULL)
# Define cell types for your experiment
cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30),
'Memory_B' = c(31:40))
# Load an association list containing peak IDs related to gene names
associationList <- data(associationList)
# Simulate multi-omics data with specific parameters
testing_groups <- sc_mosim(
omicsList,
cell_types,
numberReps = 2,
numberGroups = 3,
diffGenes = list(c(0.2, 0.3), c(0.2, 0.3)),
minFC = 0.25,
maxFC = 4,
numberCells = NULL,
mean = NULL,
sd = NULL,
regulatorEffect = list(c(0.1, 0.2), c(0.1, 0.2), c(0.1, 0.2)),
associationList = associationList,
TF = FALSE
)
In the example above, we load omics data types, specify experimental conditions and cell types, and load an association list. The sc_mosim function lets us simulate multi-omics data with various parameters, such as the number of replicates, differentially expressed genes, and regulatory effects.
Before diving into simulation, it’s essential to have your data
ready. The sc_omicData
function aids in preparing your data
for simulation. It accepts the following inputs:
omics_types: A list of omics data types, which can be “scRNA-seq” or “scATAC-seq.”
data (optional): A user-inputted list of matrices with features as rows and cells as columns. If data is NULL, the default data from 10 cells for 4 celltypes is loaded.
sc_mosim
also allows you to simulate data resembling
characteristics of a dataset of your choice. To do so, you need to
format your data using the sc_omicData
function. Supported
input formats include:
# This is done to get a dataset to extract a matrix (for example purposes)
scRNA <- MOSim::sc_omicData("scRNA-seq", data = NULL)
count <- scRNA[["scRNA-seq"]]
options(Seurat.object.assay.version = "v3")
Seurat_obj <- Seurat::CreateAssayObject(counts = count, assay = 'RNA')
# To format the data into sc_mosim-ready format, we pass the seurat
# object containing the count data we extracted into sc_omicData
omic_list_user <- sc_omicData(c("scRNA-seq"), data = c(Seurat_obj))
The resulting omic_list_user is a named list with “scRNA-seq” as the name and your count matrix as the value.
sc_mosim
can simulate scRNA, scATAC and TF count
matrices without providing any additional arguments. For a basic
simulation, you only need to input the omics list and cell types. Here’s
how it’s done:
omic_list <- sc_omicData(list("scRNA-seq"))
cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30),
'Memory_B' = c(31:40))
sim <- sc_mosim(omic_list, cell_types, TF = TRUE)
This will result in simulated raw count matrices for scRNA and TF.
The sc_mosim
function offers a range of parameters to
fine-tune your simulation:
omic_list <- sc_omicData(c("scRNA-seq", "scATAC-seq"))
cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30),
'Memory_B' = c(31:40))
sim <- sc_mosim(omic_list, cell_types, numberReps = 2,
numberGroups = 2, diffGenes = list(c(0.2, 0.3)), feature_no = 8000,
clusters = 3, mean = c(2*10^6, 1*10^6,2*10^6, 1*10^6),
sd = c(5*10^5, 2*10^5, 5*10^5, 2*10^5),
regulatorEffect = list(c(0.1, 0.2), c(0.1, 0.2)), TF = FALSE)
The result of your simulation is stored in a named list with ‘sim_sc + omic name’ as names and Seurat objects as values. Each Seurat object contains the synthetic count matrices for your experiment. Other relevant information included in the object are:
cellTypes: A list specifying the columns in each simulated matrix that correspond to each cell type.
patterns: Matrix of co-expression patterns affecting the genes throughout cell-types.
FC: list of Fold Changes applied to each gene to simulate differential expression between experimental groups.
AssociationMatrices: gene/peak association matrices including differential expression and regulatory relationships.
Variability: added variability matrices to add dispersion to experimental groups and biological replicates.
To access simulation settings and other constraints for simulation,
you can use the sc_omicSettings
function. This provides
information about the relationship between genes and peaks,
differentially expressed genes, regulator types, expression patterns,
and fold changes for each gene and peak compared to group 1.
If you run the sc_mosim simulation including TF, you can also extract
the
association matrix of TF - target genes regulatory relationships.