Title: | Multi-Omics Simulation (MOSim) |
---|---|
Description: | MOSim package simulates multi-omic experiments that mimic regulatory mechanisms within the cell, allowing flexible experimental design including time course and multiple groups. |
Authors: | Carolina Monzó [aut], Carlos Martínez [aut], Sonia Tarazona [cre, aut] |
Maintainer: | Sonia Tarazona <[email protected]> |
License: | GPL-3 |
Version: | 2.3.0 |
Built: | 2024-11-29 08:18:36 UTC |
Source: | https://github.com/bioc/MOSim |
Data to showcase scRNA and scATAC-seq association
data("associationList")
data("associationList")
A dataframe with two columns and rows according to gene/feature relationships
ATAC chromosomic positions associated to genes
RNA genes associated to peaks
@source Created in-house to serve as an example
Helper function to calculate mean expression per celltype
calculate_mean_per_list_df(df, named_lists)
calculate_mean_per_list_df(df, named_lists)
df |
dataframe of expression where columns are cells |
named_lists |
list of which cells belong to each celltype |
rna <- data.frame(c1 = c(1.5, 15.5, 3.5, 20.5), c2 = c(2, 15, 4, 20), c3 = c(10, 1, 12, 13), c4 = c(11, 1, 13, 14)) cell_types <- list("ct1" = c(1,2), "ct2" = c(3, 4)) calculate_mean_per_list_df(rna, cell_types)
rna <- data.frame(c1 = c(1.5, 15.5, 3.5, 20.5), c2 = c(2, 15, 4, 20), c3 = c(10, 1, 12, 13), c4 = c(11, 1, 13, 14)) cell_types <- list("ct1" = c(1,2), "ct2" = c(3, 4)) calculate_mean_per_list_df(rna, cell_types)
Function to check if the TRUE FALSE patterns have at least two rows that are opposite, we need this to be able to generate repressor regulators
check_patterns(patterns)
check_patterns(patterns)
patterns |
tibble of TRUE FALSE values |
list of indices where the rows are opposite
patterns <- tibble::tibble(one = c(TRUE, FALSE, TRUE, FALSE), two = c(TRUE, TRUE, TRUE, TRUE), three = c(FALSE, TRUE, FALSE, TRUE), four = c(FALSE, TRUE, TRUE, TRUE)) opposite_indices <- check_patterns(patterns)
patterns <- tibble::tibble(one = c(TRUE, FALSE, TRUE, FALSE), two = c(TRUE, TRUE, TRUE, TRUE), three = c(FALSE, TRUE, FALSE, TRUE), four = c(FALSE, TRUE, TRUE, TRUE)) opposite_indices <- check_patterns(patterns)
Discretize ChIP-Seq counts to simulate a binary dataset
discretize(df, omic)
discretize(df, omic)
df |
A MOSimulated object |
omic |
Character string of the omic to transform into binary data |
A regulator dataframe of 0 and 1
omic_list <- c("RNA-seq", "ChIP-seq") rnaseq_simulation <- mosim(omics = omic_list, omicsOptions = c(omicSim("ChIP-seq", totalFeatures = 2500))) rnaseq_simulated <- omicResults(rnaseq_simulation, omic_list) discrete_ChIP <- discretize(rnaseq_simulated, "ChIP-seq")
omic_list <- c("RNA-seq", "ChIP-seq") rnaseq_simulation <- mosim(omics = omic_list, omicsOptions = c(omicSim("ChIP-seq", totalFeatures = 2500))) rnaseq_simulated <- omicResults(rnaseq_simulation, omic_list) discrete_ChIP <- discretize(rnaseq_simulated, "ChIP-seq")
Retrieves the experimental design
experimentalDesign(simulation)
experimentalDesign(simulation)
simulation |
A MOSimulation object |
A data frame containing the experimental design used to simulate the data.
omic_list <- c("RNA-seq") rnaseq_simulation <- mosim(omics = omic_list) # This will be a data frame with RNA-seq counts design_matrix <- experimentalDesign(rnaseq_simulation)
omic_list <- c("RNA-seq") rnaseq_simulation <- mosim(omics = omic_list) # This will be a data frame with RNA-seq counts design_matrix <- experimentalDesign(rnaseq_simulation)
Check if a variable is declared.
is.declared(object, key = NULL)
is.declared(object, key = NULL)
object |
Variable name to check |
key |
Optional key to check inside object. |
TRUE or FALSE indicating if the variable is initialized & non-empty.
This function generates a dataframe containing the information of the relationship between ATAC and RNA, based on the cluster groups, and then tells the order the genes and peaks should be in the simulated dataframe of the group
make_association_dataframe( group, genereggroup, numtotalgenes, numtotalpeaks, minFC, maxFC )
make_association_dataframe( group, genereggroup, numtotalgenes, numtotalpeaks, minFC, maxFC )
group |
Group from which we are generating the association dataframe |
genereggroup |
list of elements to generate the association dataframe such as clusters of each omic, indices of opposite clusters, which genes are activated, repressed, behavior of the features etc. |
numtotalgenes |
total number of genes |
numtotalpeaks |
total number of peaks |
minFC |
FC below which is downregulated |
maxFC |
FC above which is upregulated |
a dataframe with all the information the user needs about each gene and the order of gene and peak names to rename them in the simulated datasets of the group
Function to make the tibble with cluster combinations for the gene expression patterns along the cells This function is a slightly modified copy of the 'make_cluster_patterns' function from the 'Acorde' package (v1.0.0), originally developed by Arzalluz-Luque A, Salguero P, Tarazona S, Conesa A. (2022). acorde unravels functionally interpretable networks of isoform co-usage from single cell data. Nature communications 1828. DOI: 10.1038/s41467-022-29497-w. The original package is licensed under the GPL-3 license.
make_cluster_patterns(numcells = 4, clusters = 8)
make_cluster_patterns(numcells = 4, clusters = 8)
numcells |
Number of different celltypes we are simulating |
clusters |
OPTIONAL. Number of co-expression patterns the user wants to simulate |
A tibble with number of columns equal to number of celltypes, rows according to the number of TRUE/FALSE combinations corresponding to the gene expression patterns along the cells
patterns <- make_cluster_patterns(numcells = 4, clusters = 8) cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30), 'Memory_B' = c(31:40)) patterns <- make_cluster_patterns(numcells = length(cell_types), clusters = 8)
patterns <- make_cluster_patterns(numcells = 4, clusters = 8) cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30), 'Memory_B' = c(31:40)) patterns <- make_cluster_patterns(numcells = length(cell_types), clusters = 8)
Helper function to make the most similar profiles possible between gene and regulator
match_gene_regulator(rna, atac, cell_types, associationList)
match_gene_regulator(rna, atac, cell_types, associationList)
rna |
dataframe of RNA expression |
atac |
dataframe of ATAC expression |
cell_types |
list of which cells belong to each celltype |
associationList |
dataframe of two columns, Gene_ID and Peak_ID |
rna <- data.frame(c1 = c(1.5, 15.5, 3.5, 20.5), c2 = c(2, 15, 4, 20), c3 = c(10, 1, 12, 13), c4 = c(11, 1, 13, 14), c5 = c(7, 0, 0, 0), c6 = c(8, 1, 1, 1), c7 = c(8, 1, 1, 1)) rownames(rna) <- c('GenB', 'GenA', 'GenC', 'GenD') associationList <- data.frame(Gene_ID = c('GenA', 'GenB', 'GenC', 'GenA'), Peak_ID = c('PeakA', 'PeakB', 'PeakC', 'PeakD')) cell_types <- list("ct1" = c(1,2), "ct2" = c(3, 4), "ct3" = c(5, 6), "ct4" = c(7)) atac <- data.frame(c1 = c(3,20, 1,15, 1, 7, 1), c2 = c(4,20, 2,15, 0, 5, 1.5), c3 = c(10, 13, 1, 12, 1, 14, 9), c4 = c(11, 14, 1, 13, 1, 4, 12), c5 = c(0, 0, 0, 7, 1, 6, 6), c6 = c(1, 1, 1, 8, 0, 5, 8), c7 = c(1, 1, 1, 8, 1, 5, 7)) rownames(atac) <- c('PeakB', "PeakC", "PeakF", "PeakD", "PeakE", "PeakA", "PeakG") match_gene_regulator(rna, atac, cell_types, associationList)
rna <- data.frame(c1 = c(1.5, 15.5, 3.5, 20.5), c2 = c(2, 15, 4, 20), c3 = c(10, 1, 12, 13), c4 = c(11, 1, 13, 14), c5 = c(7, 0, 0, 0), c6 = c(8, 1, 1, 1), c7 = c(8, 1, 1, 1)) rownames(rna) <- c('GenB', 'GenA', 'GenC', 'GenD') associationList <- data.frame(Gene_ID = c('GenA', 'GenB', 'GenC', 'GenA'), Peak_ID = c('PeakA', 'PeakB', 'PeakC', 'PeakD')) cell_types <- list("ct1" = c(1,2), "ct2" = c(3, 4), "ct3" = c(5, 6), "ct4" = c(7)) atac <- data.frame(c1 = c(3,20, 1,15, 1, 7, 1), c2 = c(4,20, 2,15, 0, 5, 1.5), c3 = c(10, 13, 1, 12, 1, 14, 9), c4 = c(11, 14, 1, 13, 1, 4, 12), c5 = c(0, 0, 0, 7, 1, 6, 6), c6 = c(1, 1, 1, 8, 0, 5, 8), c7 = c(1, 1, 1, 8, 1, 5, 7)) rownames(atac) <- c('PeakB', "PeakC", "PeakF", "PeakD", "PeakE", "PeakA", "PeakG") match_gene_regulator(rna, atac, cell_types, associationList)
match_gene_regulator_cluster
match_gene_regulator_cluster(rna, atac, cell_types, associationMatrix)
match_gene_regulator_cluster(rna, atac, cell_types, associationMatrix)
rna |
rna expression dataframe |
atac |
atac expression dataframe |
cell_types |
list of which cells belong to each celltype |
associationMatrix |
matrix of related genes and peaks |
rna <- data.frame(c1 = c(1.5, 15.5, 3.5, 20.5), c2 = c(2, 15, 4, 20), c3 = c(10, 1, 12, 13), c4 = c(11, 1, 13, 14), c5 = c(7, 0, 0, 0), c6 = c(8, 1, 1, 1), c7 = c(8, 1, 1, 1)) rownames(rna) <- c('GenB', 'GenA', 'GenC', 'GenD') associationList <- data.frame(Gene_ID = c('GenA', 'GenB', 'GenC', 'GenA'), Peak_ID = c('PeakA', 'PeakB', 'PeakC', 'PeakD'), Gene_cluster = c(1, 2, 1, 2), Peak_cluster = c(1, 2, 1, 2)) cell_types <- list("ct1" = c(1,2), "ct2" = c(3, 4), "ct3" = c(5, 6), "ct4" = c(7)) atac <- data.frame(c1 = c(3,20, 1,15, 1, 7, 1), c2 = c(4,20, 2,15, 0, 5, 1.5), c3 = c(10, 13, 1, 12, 1, 14, 9), c4 = c(11, 14, 1, 13, 1, 4, 12), c5 = c(0, 0, 0, 7, 1, 6, 6), c6 = c(1, 1, 1, 8, 0, 5, 8), c7 = c(1, 1, 1, 8, 1, 5, 7)) rownames(atac) <- c('PeakB', "PeakC", "PeakF", "PeakD", "PeakE", "PeakA", "PeakG") match_gene_regulator_cluster(rna, atac, cell_types, associationList)
rna <- data.frame(c1 = c(1.5, 15.5, 3.5, 20.5), c2 = c(2, 15, 4, 20), c3 = c(10, 1, 12, 13), c4 = c(11, 1, 13, 14), c5 = c(7, 0, 0, 0), c6 = c(8, 1, 1, 1), c7 = c(8, 1, 1, 1)) rownames(rna) <- c('GenB', 'GenA', 'GenC', 'GenD') associationList <- data.frame(Gene_ID = c('GenA', 'GenB', 'GenC', 'GenA'), Peak_ID = c('PeakA', 'PeakB', 'PeakC', 'PeakD'), Gene_cluster = c(1, 2, 1, 2), Peak_cluster = c(1, 2, 1, 2)) cell_types <- list("ct1" = c(1,2), "ct2" = c(3, 4), "ct3" = c(5, 6), "ct4" = c(7)) atac <- data.frame(c1 = c(3,20, 1,15, 1, 7, 1), c2 = c(4,20, 2,15, 0, 5, 1.5), c3 = c(10, 13, 1, 12, 1, 14, 9), c4 = c(11, 14, 1, 13, 1, 4, 12), c5 = c(0, 0, 0, 7, 1, 6, 6), c6 = c(1, 1, 1, 8, 0, 5, 8), c7 = c(1, 1, 1, 8, 1, 5, 7)) rownames(atac) <- c('PeakB', "PeakC", "PeakF", "PeakD", "PeakE", "PeakA", "PeakG") match_gene_regulator_cluster(rna, atac, cell_types, associationList)
Performs a multiomic simulation by chaining two actions: 1) Creating the "MOSimulation" class with the provided params. 2) Calling "simulate" method on the initialized object.
mosim( omics, omicsOptions, diffGenes, numberReps, numberGroups, times, depth, profileProbs, minMaxFC, TFtoGene )
mosim( omics, omicsOptions, diffGenes, numberReps, numberGroups, times, depth, profileProbs, minMaxFC, TFtoGene )
omics |
Character vector containing the names of the omics to simulate, which can be "RNA-seq", "miRNA-seq", "DNase-seq", "ChIP-seq" or "Methyl-seq" (e.g. c("RNA-seq", "miRNA-seq")). It can also be a list with the omic names as names and their options as values, but we recommend to use the argument omicSim to provide the options to simulated each omic. |
omicsOptions |
List containing the options to simulate each omic. We recommend to apply the helper method omicSim to create this list in a friendly way, and the function omicData to provide custom data (see the related sections for more information). Each omic may have different configuration parameters, but the common ones are:
|
diffGenes |
Number of differentially expressed genes to simulate, given in percentage (0 - 1) or in absolute number (> 1). By default 0.15 |
numberReps |
Number of replicates per experimetal condition (and time point, if time series are to be generated). By default 3. |
numberGroups |
Number of experimental groups or conditions to simulate. |
times |
Vector of time points to consider in the experimental design. |
depth |
Sequencing depth in millions of reads. |
profileProbs |
Numeric vector with the probabilities to assign each of the patterns. Defaults to 0.2 for each. |
minMaxFC |
Numeric vector of length 2 with minimum and maximum fold-change for differentially expressed features, respectively. |
TFtoGene |
A logical value indicating if default transcription factors data should be used (TRUE) or not (FALSE), or a 3 column data frame containing custom associations. By default FALSE. |
Instance of class "MOSimulation" containing the multiomic simulation data.
moSimulation <- mosim( omics = c("RNA-seq"), numberReps = 3, times = c(0, 2, 6, 12, 24) ) # Retrieve simulated count matrix for RNA-seq dataRNAseq <- omicResults(moSimulation, "RNA-seq")
moSimulation <- mosim( omics = c("RNA-seq"), numberReps = 3, times = c(0, 2, 6, 12, 24) ) # Retrieve simulated count matrix for RNA-seq dataRNAseq <- omicResults(moSimulation, "RNA-seq")
Set customized data for an omic.
omicData(omic, data = NULL, associationList = NULL)
omicData(omic, data = NULL, associationList = NULL)
omic |
The name of the omic to provide data. |
data |
Data frame with the omic identifiers as row names and just one column named Counts containing numeric values used as initial sample for the simulation. |
associationList |
Only for regulatory omics, a data frame with 2 columns, the first called containing the regulator ID and the second called Gene with the gene identifier. |
Initialized simulation object with the given data.
# Take a subset of the included dataset for illustration # purposes. We could also load it from a csv file or RData, # as long as we transform it to have 1 column named "Counts" # and the identifiers as row names. data(sampleData) custom_rnaseq <- head(sampleData$SimRNAseq$data, 100) # In this case, 'custom_rnaseq' is a data frame with # the structure: head(custom_rnaseq) ## Counts ## ENSMUSG00000000001 6572 ## ENSMUSG00000000003 0 ## ENSMUSG00000000028 4644 ## ENSMUSG00000000031 8 ## ENSMUSG00000000037 0 ## ENSMUSG00000000049 0 # The helper 'omicData' returns an object with our custom data. rnaseq_customdata <- omicData("RNA-seq", data = custom_rnaseq)
# Take a subset of the included dataset for illustration # purposes. We could also load it from a csv file or RData, # as long as we transform it to have 1 column named "Counts" # and the identifiers as row names. data(sampleData) custom_rnaseq <- head(sampleData$SimRNAseq$data, 100) # In this case, 'custom_rnaseq' is a data frame with # the structure: head(custom_rnaseq) ## Counts ## ENSMUSG00000000001 6572 ## ENSMUSG00000000003 0 ## ENSMUSG00000000028 4644 ## ENSMUSG00000000031 8 ## ENSMUSG00000000037 0 ## ENSMUSG00000000049 0 # The helper 'omicData' returns an object with our custom data. rnaseq_customdata <- omicData("RNA-seq", data = custom_rnaseq)
Retrieves the simulated data.
omicResults(simulation, omics = NULL, format = "data.frame")
omicResults(simulation, omics = NULL, format = "data.frame")
simulation |
A MOSimulation object. |
omics |
List of the omics to retrieve the simulated data. |
format |
Type of object to use for returning the results |
A list containing an element for every omic specifiec, with the simulation data in the format indicated, or a numeric matrix with simulated data if the omic name is directly provided.
omic_list <- c("RNA-seq") rnaseq_simulation <- mosim(omics = omic_list) #' # This will be a data frame with RNA-seq counts rnaseq_simulated <- omicResults(rnaseq_simulation, "RNA-seq") # Group1.Time0.Rep1 Group1.Time0.Rep2 Group1.Time0.Rep3 ... # ENSMUSG00000073155 4539 5374 5808 ... # ENSMUSG00000026251 0 0 0 ... # ENSMUSG00000040472 2742 2714 2912 ... # ENSMUSG00000021598 5256 4640 5130 ... # ENSMUSG00000032348 421 348 492 ... # ENSMUSG00000097226 16 14 9 ... # ENSMUSG00000027857 0 0 0 ... # ENSMUSG00000032081 1 0 0 ... # ENSMUSG00000097164 794 822 965 ... # ENSMUSG00000097871 0 0 0 ...
omic_list <- c("RNA-seq") rnaseq_simulation <- mosim(omics = omic_list) #' # This will be a data frame with RNA-seq counts rnaseq_simulated <- omicResults(rnaseq_simulation, "RNA-seq") # Group1.Time0.Rep1 Group1.Time0.Rep2 Group1.Time0.Rep3 ... # ENSMUSG00000073155 4539 5374 5808 ... # ENSMUSG00000026251 0 0 0 ... # ENSMUSG00000040472 2742 2714 2912 ... # ENSMUSG00000021598 5256 4640 5130 ... # ENSMUSG00000032348 421 348 492 ... # ENSMUSG00000097226 16 14 9 ... # ENSMUSG00000027857 0 0 0 ... # ENSMUSG00000032081 1 0 0 ... # ENSMUSG00000097164 794 822 965 ... # ENSMUSG00000097871 0 0 0 ...
Retrieves the settings used in a simulation
omicSettings( simulation, omics = NULL, association = FALSE, reverse = FALSE, only.linked = FALSE, prefix = FALSE, include.lagged = TRUE )
omicSettings( simulation, omics = NULL, association = FALSE, reverse = FALSE, only.linked = FALSE, prefix = FALSE, include.lagged = TRUE )
simulation |
A MOSimulation object. |
omics |
List of omics to retrieve the settings. |
association |
A boolean indicating if the association must also be returned for the regulators. |
reverse |
A boolean, swap the column order in the association list in case we want to use the output directly and the program requires a different ordering. |
only.linked |
Return only the interactions that have an effect. |
prefix |
Logical indicating if the name of the omic should prefix the name of the regulator. |
include.lagged |
Logical indicating if interactions with transitory profile and different minimum/maximum time point between gene and regulator should be included or not. |
A list containing a data frame with the settings used to simulate each of the indicated omics. If association is TRUE, it will be a list with 3 keys: 'associations', 'settings' and 'regulators', with the first two keys being a list containing the information for the selected omics and the last one a global data frame giving the merged information.
omic_list <- c("RNA-seq", "miRNA-seq") multi_simulation <- mosim(omics = omic_list) # This will be a data frame with RNA-seq settings (DE flag, profiles) rnaseq_settings <- omicSettings(multi_simulation, "RNA-seq") # This will be a list containing all the simulated omics (RNA-seq # and DNase-seq in this case) all_settings <- omicSettings(multi_simulation)
omic_list <- c("RNA-seq", "miRNA-seq") multi_simulation <- mosim(omics = omic_list) # This will be a data frame with RNA-seq settings (DE flag, profiles) rnaseq_settings <- omicSettings(multi_simulation, "RNA-seq") # This will be a list containing all the simulated omics (RNA-seq # and DNase-seq in this case) all_settings <- omicSettings(multi_simulation)
Set the simulation settings for an omic.
omicSim(omic, depth = NULL, totalFeatures = NULL, regulatorEffect = NULL)
omicSim(omic, depth = NULL, totalFeatures = NULL, regulatorEffect = NULL)
omic |
Name of the omic to set the settings. |
depth |
Sequencing depth in millions of counts. If not provided will take the global parameter passed to mosim function. |
totalFeatures |
Limit the number of features to simulate. By default include all present in the dataset. |
regulatorEffect |
only for regulatory omics. Associative list containing the percentage of effects over the total number of regulator, including repressor, association and no effect (NE). |
A list with the appropiate structure to be given as options in mosim function.
omic_list <- c("RNA-seq", "miRNA-seq") rnaseq_options <- c(omicSim("miRNA-seq", totalFeatures = 2500)) # The return value is an associative list compatible with # 'omicsOptions' rnaseq_simulation <- mosim(omics = omic_list, omicsOptions = rnaseq_options)
omic_list <- c("RNA-seq", "miRNA-seq") rnaseq_options <- c(omicSim("miRNA-seq", totalFeatures = 2500)) # The return value is an associative list compatible with # 'omicsOptions' rnaseq_simulation <- mosim(omics = omic_list, omicsOptions = rnaseq_options)
Generate a plot of a feature's profile for one or two omics.
plotProfile(simulation, omics, featureIDS, drawReps = FALSE, groups = NULL)
plotProfile(simulation, omics, featureIDS, drawReps = FALSE, groups = NULL)
simulation |
A MOSimulation object |
omics |
Character vector of the omics to simulate. |
featureIDS |
List containing the feature to show per omic. Must have the omics as the list names and the features as values. |
drawReps |
Logical to enable/disable the representation of the replicates inside the plot. |
groups |
Character vector indicating the groups to plot in the form "GroupX" (i.e. Group1) |
A ggplot2 object.
omic_list <- c("RNA-seq", "miRNA-seq") rnaseq_options <- c(omicSim("miRNA-seq", totalFeatures = 2500)) rnaseq_simulation <- mosim(omics = omic_list, omicsOptions = rnaseq_options) #plotProfile(rnaseq_simulation, # omics = c("RNA-seq", "miRNA-seq"), # featureIDS = list("RNA-seq"="ENSMUSG00000007682", "miRNA-seq"="mmu-miR-320-3p") #)
omic_list <- c("RNA-seq", "miRNA-seq") rnaseq_options <- c(omicSim("miRNA-seq", totalFeatures = 2500)) rnaseq_simulation <- mosim(omics = omic_list, omicsOptions = rnaseq_options) #plotProfile(rnaseq_simulation, # omics = c("RNA-seq", "miRNA-seq"), # featureIDS = list("RNA-seq"="ENSMUSG00000007682", "miRNA-seq"="mmu-miR-320-3p") #)
random_unif_interval Function to call the C code This function is a copy of the 'random_unif_interval' function from the 'SPARSim' package (v0.9.5), originally developed by Giacomo Baruzzo, Ilaria Patuzzi, Barbara Di Camillo (2020). The original package is licensed under the GPL-3 license.
random_unif_interval(size, max_val)
random_unif_interval(size, max_val)
size |
from sparsim |
max_val |
from sparsim |
Dataset with base counts and id-gene tables.
data("sampleData")
data("sampleData")
An object of class list
of length 6.
List with 6 elements:
Dataframe with base counts with gene id as rownames.
Length of every gene.
Dataframe with base counts with regions as rownames.
Dataframe with region as "ID" column and gene name on "Gene" column.
Dataframe with base counts with regions as rownames.
Dataframe with region as "ID" column and gene name on "Gene" column.
Dataframe with base counts with miRNA id as rownames.
Dataframe with miRNA as "ID" column and gene name on "Gene" column.
Dataframe with region as "ID" column and gene name on "Gene" column.
Dataframe of CpG to be used as initialization data, located on "Region" column
Performs multiomic simulation of single cell datasets
sc_mosim( omics, cellTypes, numberReps = 1, numberGroups = 1, diffGenes = NULL, minFC = 0.25, maxFC = 4, numberCells = NULL, mean = NULL, sd = NULL, noiseRep = 0.1, noiseGroup = 0.5, regulatorEffect = NULL, associationList = NULL, feature_no = 8000, clusters = 3, cluster_size = NULL, TF = FALSE, TFdf = NULL )
sc_mosim( omics, cellTypes, numberReps = 1, numberGroups = 1, diffGenes = NULL, minFC = 0.25, maxFC = 4, numberCells = NULL, mean = NULL, sd = NULL, noiseRep = 0.1, noiseGroup = 0.5, regulatorEffect = NULL, associationList = NULL, feature_no = 8000, clusters = 3, cluster_size = NULL, TF = FALSE, TFdf = NULL )
omics |
named list containing the omic to simulate as names, which can be "scRNA-seq" or "scATAC-seq". |
cellTypes |
list where the i-th element of the list contains the column indices for i-th experimental conditions. List must be a named list. |
numberReps |
OPTIONAL. Number of replicates per group |
numberGroups |
OPTIONAL. number of different groups |
diffGenes |
OPTIONAL. If number groups > 1, Percentage DE genes to simulate. List of vectors (one per group to compare to group 1) where the vector contains absolute number of genes for Up and Down ex: c(250, 500) or a percentage for up, down ex: c(0.2, 0.2). The rest will be NE |
minFC |
OPTIONAL. Threshold of FC below which are downregulated, by default 0.25 |
maxFC |
OPTIONAL. Threshold of FC above which are upregulated, by default 4 |
numberCells |
OPTIONAL. Vector of numbers. The numbers correspond to the number of
cells the user wants to simulate per each cell type. The length of the
vector must be the same as length of |
mean |
OPTIONAL. Vector of numbers of mean depth per each cell type. Must be specified
just if |
sd |
OPTIONAL. Vector of numbers of standard deviation per each cell type. Must be
specified just if |
noiseRep |
OPTIONAL. Number indicating the desired standard deviation between biological replicates. |
noiseGroup |
OPTIONAL. Number indicating the desired standard deviation between treatment groups |
regulatorEffect |
OPTIONAL. To simulate relationship scRNA-scATAC, list of vectors (one per group) where the vector contains absolute number of regulators for Activator and repressor ex: c(150, 200) or a percentage for Activator and repressor ex: c(0.2, 0.1). The rest will be NE. If not provided, no table of association between scRNA and scATAC is outputted. |
associationList |
REQUIRED A 2 columns dataframe reporting peak ids related to gene names. If user doesnt have one, load from package data("associationList") |
feature_no |
OPTIONAL. If only scRNA-seq to simulate or scRNA and scATAC but no regulatory constraints, total number of features to be distributed between the coexpression clusters. |
clusters |
OPTIONAL. Number of co-expression patterns the user wants to simulate |
cluster_size |
OPTIONAL. It may be inputted by the user. Recommended: by default, its the number of features divided by the number of patterns to generate. |
TF |
OPTIONAL default is FALSE, if true, extract TF dataframe |
TFdf |
OPTIONAL, default is NULL. If an association matrix of TF and Target_gene is given the TF expression values are extracted. If no data.frame is given, using the association of human TF from https://tflink.net/ |
a list of Seurat object, one per each omic.
omic_list <- sc_omicData(list("scRNA-seq")) cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30), 'Memory_B' = c(31:40)) sim <- sc_mosim(omic_list, cell_types)
omic_list <- sc_omicData(list("scRNA-seq")) cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30), 'Memory_B' = c(31:40)) sim <- sc_mosim(omic_list, cell_types)
Checks if the user defined data is in the correct format, or loads the default multiomics pbmc dataset, a subset from SeuratData package
sc_omicData(omics_types, data = NULL)
sc_omicData(omics_types, data = NULL)
omics_types |
A list of strings which can be either "scRNA-seq" or "scATAC-seq" |
data |
A user input matrix with genes (peaks in case of scATAC-seq) as rows and cells as columns. By default, it loads the example data. If a user input matrix is included, cell columns must be sorted by cell t ype. |
a named list with omics type as name and the count matrix as value
# Simulate from PBMC omicsList <- sc_omicData(list("scRNA-seq", "scATAC-seq"))
# Simulate from PBMC omicsList <- sc_omicData(list("scRNA-seq", "scATAC-seq"))
sc_omicResults
sc_omicResults(sim)
sc_omicResults(sim)
sim |
a simulated object from sc_mosim function |
list of seurat objects with simulated data
cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30), 'Memory_B' = c(31:40)) omicsList <- sc_omicData(list("scRNA-seq")) sim <- sc_mosim(omicsList, cell_types) res <- sc_omicResults(sim)
cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30), 'Memory_B' = c(31:40)) omicsList <- sc_omicData(list("scRNA-seq")) sim <- sc_mosim(omicsList, cell_types) res <- sc_omicResults(sim)
sc_omicSettings
sc_omicSettings(sim, TF = FALSE)
sc_omicSettings(sim, TF = FALSE)
sim |
a simulated object from sc_mosim function |
TF |
OPTIONAL default is FALSE, if true, extract TF association matrix |
list of Association matrices explaining the effects of each regulator to each gene
cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30), 'Memory_B' = c(31:40)) omicsList <- sc_omicData(list("scRNA-seq")) sim <- sc_mosim(omicsList, cell_types) res <- sc_omicSettings(sim)
cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30), 'Memory_B' = c(31:40)) omicsList <- sc_omicData(list("scRNA-seq")) sim <- sc_mosim(omicsList, cell_types) res <- sc_omicSettings(sim)
Evaluate the users parameters for single cell simulation and use SPARSim to simulate the main dataset. Internal function
sc_param_estimation( omics, cellTypes, diffGenes = list(c(0.2, 0.2)), minFC = 0.25, maxFC = 4, numberCells = NULL, mean = NULL, sd = NULL, noiseGroup = 0.5, group = 1, genereggroup )
sc_param_estimation( omics, cellTypes, diffGenes = list(c(0.2, 0.2)), minFC = 0.25, maxFC = 4, numberCells = NULL, mean = NULL, sd = NULL, noiseGroup = 0.5, group = 1, genereggroup )
omics |
named list containing the omics to simulate as names, which can be "scRNA-seq" or "scATAC-seq". |
cellTypes |
list where the i-th element of the list contains the column indices for i-th cell type. List must be a named list. |
diffGenes |
If number groups > 1, Percentage DE genes to simulate. List of vectors (one per group to compare to group 1) where the vector contains absolute number of genes for Up and Down ex: c(250, 500) or a percentage for up, down ex: c(0.2, 0.2). The rest will be NE |
minFC |
Threshold of FC below which are downregulated, by default 0.25 |
maxFC |
Threshold of FC above which are upregulated, by default 4 |
numberCells |
vector of numbers. The numbers correspond to the number
of cells the user wants to simulate per each cell type. The length of the
vector must be the same as length of |
mean |
vector of numbers of mean depth per each cell type. Must be specified
just if |
sd |
vector of numbers of standard deviation per each cell type. Must be
specified just if |
noiseGroup |
OPTIONAL. Number indicating the desired standard deviation between treatment groups |
group |
Group for which to estimate parameters |
genereggroup |
List with information of genes, clusters and regulators that must be related to each other |
a list of Seurat object, one per each omic.
a named list with simulation parameters for each omics as values.
omicsList <- sc_omicData(list("scRNA-seq")) cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30), 'Memory_B' = c(31:40)) #estimated_params <- sc_param_estimation(omicsList, cell_types)
omicsList <- sc_omicData(list("scRNA-seq")) cell_types <- list('Treg' = c(1:10),'cDC' = c(11:20),'CD4_TEM' = c(21:30), 'Memory_B' = c(31:40)) #estimated_params <- sc_param_estimation(omicsList, cell_types)
Data to test scMOSim
data("scatac")
data("scatac")
A seurat Object, subset from seuratData with ATAC
ATAC expression values
annotations of celltypes
@source https://github.com/satijalab/seurat-data, we took 11 cells from each of 4 celltypes
Data to test scMOSim
data("scrna")
data("scrna")
A seurat Object, subset from seuratData with RNA
RNA expression values
annotations of celltypes
@source https://github.com/satijalab/seurat-data, we took 11 cells from each of 4 celltypes This is how: dat <- pbmcMultiome.SeuratData::pbmc.rna dat <- subset(x = dat, subset = seurat_annotations "cDC", "Memory B", "Treg")) unique_cell_types <- unique(datATmeta.data$seurat_annotations) extracted_cells <- list() cellnames <- c() for (cell_type in unique_cell_types) type_cells <- subset(dat, subset = seurat_annotations counts <- as.matrix(type_cellsATassays[["RNA"]]ATcounts) extracted_cells[[cell_type]] <- counts[, 1:10] cellnames <- append(cellnames, replicate(11, cell_type)) scrna <- Reduce(cbind, extracted_cells)
This function is used internally by acorde
to perform
the shuffling of simulated features for an individual cell type, as part of
the co-expression simulation process. The function is called recursively by
simulate_coexpression()
to
perform the simulation on a full scRNA-seq matrix.
shuffle_group_matrix(sim_data, feature_ids, group_pattern, ngroups)
shuffle_group_matrix(sim_data, feature_ids, group_pattern, ngroups)
sim_data |
A count matrix with features as rows and cells as columns.
Feature IDs must be included in an additional column named |
feature_ids |
A two-column |
group_pattern |
A logical vector, containing |
ngroups |
An integer indicating the number of groups that top and bottom features should be divided into. It is computed by dividing the number of features selected as highly/lowly expressed by the size of the clusters that are to be generated. |
An expression matrix, with the same characteristics as sim_data
,
and a number of features defined as the total amount of top/bottom features
selected divided by the number of clusters for which co-expression patterns
where supplied.
Adapted from ACORDE (https://github.com/ConesaLab/acorde) to adapt to our data input type. Simulates coexpression of genes along celltypes
simulate_coexpression( sim_matrix, feature_no, cellTypes, patterns, cluster_size = NULL )
simulate_coexpression( sim_matrix, feature_no, cellTypes, patterns, cluster_size = NULL )
sim_matrix |
Matrix with rows as features and columns as cells |
feature_no |
Total number of features to be distributed between the coexpression clusters |
cellTypes |
list where the i-th element of the list contains the column indices for i-th experimental conditions. List must be a named list. |
patterns |
Tibble with TRUE FALSE depicting the cluster patterns to
simulate. Generated by the user or by |
cluster_size |
OPTIONAL. It may be inputted by the user. By default, its the number of features divided by the number of patterns to generate. |
This function is a slightly modified copy of the 'simulate_coexpression' function from the 'Acorde' package (v1.0.0), originally developed by Arzalluz-Luque A, Salguero P, Tarazona S, Conesa A. (2022). acorde unravels functionally interpretable networks of isoform co-usage from single cell data. Nature communications 1828. DOI: 10.1038/s41467-022-29497-w. The original package is licensed under the GPL-3 license.
the simulated coexpression
Function to simulate the technical variability (i.e. a multivariate hypergeometric on a gamma expression value array)
simulate_hyper(avgAbund, seqdepth = NULL, digits, max_val)
simulate_hyper(avgAbund, seqdepth = NULL, digits, max_val)
avgAbund |
array containing the intensity values for each feature. It describes the intensity of a single sample |
seqdepth |
sequencing depth (i.e. sample size of the MH) |
digits |
number of digits for random number generation |
max_val |
max value for random number generation |
This function is a copy of the 'simulate_hyper' function from the 'SPARSim' package (v0.9.5), originally developed by Giacomo Baruzzo, Ilaria Patuzzi, Barbara Di Camillo (2020). The original package is licensed under the GPL-3 license.
An array of length(avgAbund)
elements representing the count values for the current sample
Function to create a SPARSim simulation parameter.
This function is a copy of the 'SPARSIM_create_simulation_parameter' function from the
'SPARSim' package (v0.9.5), originally developed by Giacomo Baruzzo,
Ilaria Patuzzi, Barbara Di Camillo (2020). The original package is licensed
under the GPL-3 license.
To simulate N feature (e.g. genes), user must specify N values of gene expression level and gene expression variability in the function input parameters intensity
and variability
, respectively.
To simulate M samples (i.e. cells), user must specify M values of sample library size in the function input parameter library_size
.
sparsim_create_simulation_parameter( intensity, variability, library_size, feature_names = NA, sample_names = NA, condition_name = NA, intensity_2 = NULL, variability_2 = NULL, p_bimod = NULL )
sparsim_create_simulation_parameter( intensity, variability, library_size, feature_names = NA, sample_names = NA, condition_name = NA, intensity_2 = NULL, variability_2 = NULL, p_bimod = NULL )
intensity |
Array of gene expression intensity values |
variability |
Array of gene expression variability values |
library_size |
Array of library size values |
feature_names |
Array of feature names. It must be of the same length of |
sample_names |
Array of sample names. It must be of the same length of |
condition_name |
Name associated to the current experimental condition. If NA (default), it will be set to "cond<l1><l2>", where l1 and l2 are two random letters. |
intensity_2 |
Array of gene expression intensity values for the second expression mode, if simulating genes with bimodal gene expression. Entries containing |
variability_2 |
Array of gene expression variability values for the second expression mode, if simulating genes with bimodal gene expression. If NULL (default), no bimodal gene expression is simulated. |
p_bimod |
Array of bimodal gene expression probabilities; the i-th value indicates the probability |
User can optionally specify the names to assign at the single feature and sample to simulate (function input parameters feature_names
and sample_names
, respectively,
as well as the name of the experimental condition (function input parameter condition_name
). If the user does not specify such information, the function will set some default values.
To simulate T different experimental conditions in a single count table, then T different simulation parameters must be created.
SPARSim simulation parameter describing one experimental condition
Function to estimate the intensity values from the genes in data
. The intensity is computed as mean of normalized counts for each gene.
sparsim_estimate_intensity(data)
sparsim_estimate_intensity(data)
data |
normalized count data matrix (gene on rows, samples on columns). |
This function is a copy of the 'SPARSIM_estimate_intensity' function from the
'SPARSim' package (v0.9.5), originally developed by Giacomo Baruzzo,
Ilaria Patuzzi, Barbara Di Camillo (2020). The original package is licensed
under the GPL-3 license.
This function is used in sparsim_estimate_parameter_from_data
to compute SPARSim "intensity" parameter, given a real count table as input.
If the count table contains more than one experimental condition, then the function is applied to each experimental conditions.
An array of intensity values having N_genes
elements (N_genes = nrow(data)
). Array entries are named with gene names.
Function to estimate the library sizes from the samples in data
.
sparsim_estimate_library_size(data)
sparsim_estimate_library_size(data)
data |
raw count data matrix (gene on rows, samples on columns) |
This function is a copy of the 'SPARSIM_estimate_library_size' function from the
'SPARSim' package (v0.9.5), originally developed by Giacomo Baruzzo,
Ilaria Patuzzi, Barbara Di Camillo (2020). The original package is licensed
under the GPL-3 license.
This function is used in sparsim_estimate_parameter_from_data
to compute SPARSim "library size" parameter, given a real count table as input.
If the count table contains more than one experimental condition, then the function is applied to each experimental conditions.
An array of library size values having N_samples
elements (N_samples = ncol(data)
)
Function to estimate SPARSim simulation parameters (intensity, variability and library sizes) from a real count table. If the real count table contains more than one experimental condition, it is possible to estimate the parameters for each experimental condition.
sparsim_estimate_parameter_from_data(raw_data, norm_data, conditions)
sparsim_estimate_parameter_from_data(raw_data, norm_data, conditions)
raw_data |
count matrix (gene on rows, samples on columns) containing raw count data |
norm_data |
count matrix (gene on rows, samples on columns) containing normalized count data |
conditions |
list where the i-th element of the list contains the column indices for i-th experimental conditions. List must be a named list. |
This function is a copy of the 'SPARSIM_estimate_parameter_from_data' function from the 'SPARSim' package (v0.9.5), originally developed by Giacomo Baruzzo, Ilaria Patuzzi, Barbara Di Camillo (2020). The original package is licensed under the GPL-3 license.
A SPARSim simulation parameters
Function to estimate the variability values from the genes in data
.
sparsim_estimate_variability(data)
sparsim_estimate_variability(data)
data |
raw count data matrix (gene on rows, samples on columns) |
This function is a copy of the 'SPARSIM_estimate_variability' function from the
'SPARSim' package (v0.9.5), originally developed by Giacomo Baruzzo,
Ilaria Patuzzi, Barbara Di Camillo (2020). The original package is licensed
under the GPL-3 license.
This function is used in sparsim_estimate_parameter_from_data
to compute SPARSim "variability" parameter, given a real count table as input.
If the count table contains more than one experimental condition, then the function is applied to each experimental conditions.
An array of variability values having N_genes
elements (N_genes = nrow(data)
)
This function is a copy of the 'SPARSIM_simulation' function from the 'SPARSim' package (v0.9.5), originally developed by Giacomo Baruzzo, Ilaria Patuzzi, Barbara Di Camillo (2020). The original package is licensed under the GPL-3 license.
sparsim_simulation( dataset_parameter, output_sim_param_matrices = FALSE, output_batch_matrix = FALSE, count_data_simulation_seed = NULL )
sparsim_simulation( dataset_parameter, output_sim_param_matrices = FALSE, output_batch_matrix = FALSE, count_data_simulation_seed = NULL )
dataset_parameter |
list containing, the intensity, variability and lib sizes of each experimental condition. It is the return value of "estimate_parameter_from_data" or could be created by the users |
output_sim_param_matrices |
boolean flag. If TRUE, the function will output two additional matrices, called abundance_matrix and variability_matrix, containing the gene intensities and gene variabilities used as simulation input. (Default: FALSE) |
output_batch_matrix |
boolean flag. If TRUE, the function will output an additional matrix, called batch_factors_matrix, containing the multiplicative factors used in batch effect simulation. (Default: FALSE) |
count_data_simulation_seed |
inherited from sparsim |
A list of 5 elements:
- count_matrix
: the simulated count matrix (genes on rows, samples on columns)
- gene_matrix
: the simulated gene expression levels (genes on rows, samples on columns)
- abundance_matrix
: the input gene intensity values provided as input (genes on rows, samples on columns), if output_sim_param_matrices
= TRUE. NULL otherwise.
- variability_matrix
: the input gene variability values provided as input (genes on rows, samples on columns), if output_sim_param_matrices
= TRUE. NULL otherwise.
- batch_factors_matrix
: the multiplicative factor used in batch generation (genes on rows, samples on columns), if output_batch_matrix
= TRUE. NULL otherwise.
Data to extract human TF
data("TF_human")
data("TF_human")
vector of gene names
gene names corresponding to TF and to Target genes
@source https://tflink.net/