Introducing mRNA bias into simulations with scaling factors

Using scaling factors

This package allows the user to introduce an mRNA bias into pseudo-bulk RNA-seq samples. Different cell-types contain different amounts of mRNA (Dendritic cells for examples contain much more than Neutrophils); this bias can be added into simulations artificially in different ways.
The scaling factors will always be applied on the single-cell dataset first, altering its expression profiles accordingly, and then the pseudo-bulk samples are generated by summing up the count data from the sampled cells.

# Example data
counts <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol = 300), sparse = TRUE)
tpm <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol = 300), sparse = TRUE)
tpm <- Matrix::t(1e6 * Matrix::t(tpm) / Matrix::colSums(tpm))
colnames(counts) <- paste0("cell_", rep(1:300))
colnames(tpm) <- paste0("cell_", rep(1:300))
rownames(counts) <- paste0("gene_", rep(1:1000))
rownames(tpm) <- paste0("gene_", rep(1:1000))
annotation <- data.frame(
  "ID" = paste0("cell_", rep(1:300)),
  "cell_type" = c(rep("T cells CD4", 150), rep("T cells CD8", 150)),
  "spikes" = stats::runif(300),
  "add_1" = stats::runif(300),
  "add_2" = stats::runif(300)
)
ds <- SimBu::dataset(
  annotation = annotation,
  count_matrix = counts,
  name = "test_dataset"
)
#> Filtering genes...
#> Created dataset.

Pre-defined scaling factors

Some studies have proposed scaling factors for immune cells, such as EPIC (Racle et al. 2017) or quanTIseq (Finotello et al. 2019), deconvolution tools which correct for the mRNA bias internally using these values:

epic <- data.frame(
  type = c(
    "B cells",
    "Macrophages",
    "Monocytes",
    "Neutrophils",
    "NK cells",
    "T cells",
    "T cells CD4",
    "T cells CD8",
    "T helper cells",
    "T regulatory cells",
    "otherCells",
    "default"
  ),
  mRNA = c(
    0.4016,
    1.4196,
    1.4196,
    0.1300,
    0.4396,
    0.3952,
    0.3952,
    0.3952,
    0.3952,
    0.3952,
    0.4000,
    0.4000
  )
)
epic
#>                  type   mRNA
#> 1             B cells 0.4016
#> 2         Macrophages 1.4196
#> 3           Monocytes 1.4196
#> 4         Neutrophils 0.1300
#> 5            NK cells 0.4396
#> 6             T cells 0.3952
#> 7         T cells CD4 0.3952
#> 8         T cells CD8 0.3952
#> 9      T helper cells 0.3952
#> 10 T regulatory cells 0.3952
#> 11         otherCells 0.4000
#> 12            default 0.4000
quantiseq <- data.frame(
  type = c(
    "B cells",
    "Macrophages",
    "MacrophagesM2",
    "Monocytes",
    "Neutrophils",
    "NK cells",
    "T cells CD4",
    "T cells CD8",
    "T regulatory cells",
    "Dendritic cells",
    "T cells"
  ),
  mRNA = c(
    65.66148,
    138.11520,
    119.35447,
    130.65455,
    27.73634,
    117.71584,
    63.87200,
    70.25659,
    72.55110,
    140.76091,
    68.89323
  )
)
quantiseq
#>                  type      mRNA
#> 1             B cells  65.66148
#> 2         Macrophages 138.11520
#> 3       MacrophagesM2 119.35447
#> 4           Monocytes 130.65455
#> 5         Neutrophils  27.73634
#> 6            NK cells 117.71584
#> 7         T cells CD4  63.87200
#> 8         T cells CD8  70.25659
#> 9  T regulatory cells  72.55110
#> 10    Dendritic cells 140.76091
#> 11            T cells  68.89323

When you want to apply one of these scaling factors into your simulation (therefore in-/decreasing the expression signals for the cell-types), we can use the scaling_factor parameter. Note, that these pre-defined scaling factors only offer values for a certain number of cell types, and your annotation in the provided dataset has to match these names 1:1. All cell types from your dataset which are not present in this scaling factor remain unscaled and a warning message will appear.

sim_epic <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "epic",
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)
#> Using parallel generation of simulations.
#> Using EPIC scaling factors.
#> Finished simulation.

We can also try out some custom scaling factors, for example increasing the expression levels for a single cell-type (T cells CD8) by 10-fold compared to the rest. All cell-types which are not mentioned in the named list given to custom_scaling_vector will be transformed with a scaling factor of 1, meaning nothing changes for them.

sim_extreme_b <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "custom",
  custom_scaling_vector = c("T cells CD8" = 10),
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)
#> Using parallel generation of simulations.
#> Using custom scaling factors.
#> Warning in merge_scaling_factor(data = data, scaling_factor_values =
#> custom_scaling_vector, : For some cell type(s) in the dataset, no scaling
#> factor is available when using custom: T cells CD4. This cell type will not be
#> re-scaled.
#> Finished simulation.

Important: Watch out that the cell-type annotation names in your dataset are the same as in the scaling factor! Otherwise the scaling factor will not be applied or even worse, applied to a different cell-type.

Dataset specific scaling factors

You can also choose to calculate scaling factors, which are depending on your provided single-cell dataset. Compared to the previous section, this will give a unique value for each cell rather than a cell-type, making it possibly more sensitive.

Reads and genes

Two straight forward approaches would be the number of reads or number of expressed genes/features. As these values are easily obtainable from the provided count data, SimBu already calculates them during dataset generation.

sim_reads <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "read_number", # use number of reads as scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)
#> Using parallel generation of simulations.
#> Finished simulation.
sim_genes <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "expressed_genes", # use number of expressed genes column as scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)
#> Using parallel generation of simulations.
#> Finished simulation.

These options would also allow you to use other numerical measurements you have for the single cells as scaling factors, such as weight or size for example. Lets pretend, add_1 and add_2 are such measurements. With the additional_cols parameter, they can be added to the SimBu dataset and we can use them as scaling factor as well:

utils::head(annotation)
#>       ID   cell_type     spikes     add_1      add_2
#> 1 cell_1 T cells CD4 0.20320152 0.6607777 0.34424450
#> 2 cell_2 T cells CD4 0.99456691 0.4105266 0.07620368
#> 3 cell_3 T cells CD4 0.02251722 0.6575505 0.04280217
#> 4 cell_4 T cells CD4 0.60889578 0.8029634 0.99499934
#> 5 cell_5 T cells CD4 0.07302276 0.6908290 0.80411374
#> 6 cell_6 T cells CD4 0.90595805 0.8725902 0.53176993
ds <- SimBu::dataset(
  annotation = annotation,
  count_matrix = counts,
  name = "test_dataset",
  additional_cols = c("add_1", "add_2")
) # add columns to dataset
#> Filtering genes...
#> Created dataset.
sim_genes <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "add_1", # use add_1 column as scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)
#> Using parallel generation of simulations.
#> Scaling by custom column in annotation table; if no scaling is wished instead, use 'NONE'.
#> Finished simulation.

Spike-ins

One other numerical measurement can be spike-ins. Usually the number of reads mapped to spike-in molecules per cell is given in the cell annotations. If this is the case, they can be stored in the dataset annotation using the spike_in_col parameter, where you indicate the name of the column from the annotation dataframe in which the spike-in information is stored. To calculate a scaling factor from this, the number of reads are also necessary, so we will add this information as well (as above using the read_number_col parameter).

The scaling factor with spike-ins is calculated as the “% of reads NOT mapped to spike-in reads”, or: (n_reads - n_spike_in)/n_reads for each cell. We apply it like this:

ds <- SimBu::dataset(
  annotation = annotation,
  count_matrix = counts,
  name = "test_dataset",
  spike_in_col = "spikes"
) # give the name in the annotation file, that contains spike-in information
sim_spike <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "spike_in", # use spike-in scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)

Census - estimate mRNA counts per cell

Census is an approach which tries to convert TPM counts into relative transcript counts. This basically means, you get the mRNA counts per cell, which can differ between cell-types.
(Qiu et al. 2017) state in their paper, that it should only be applied to TPM/FPKM normalized data, but I tried it out with raw expression counts as well, which worked as well.
Census calculates a vector with a scaling value for each cell in a sample. You can switch this feature on, by setting the scaling_factor parameter to census.

ds <- SimBu::dataset(
  annotation = annotation,
  count_matrix = counts,
  tpm_matrix = tpm,
  name = "test_dataset"
)
#> Filtering genes...
#> Created dataset.
sim_census <- SimBu::simulate_bulk(
  data = ds,
  scenario = "random",
  scaling_factor = "census", # use census scaling factor
  nsamples = 10,
  ncells = 100,
  BPPARAM = BiocParallel::MulticoreParam(workers = 4),
  run_parallel = TRUE
)
#> Using parallel generation of simulations.
#> Finished simulation.

In our analysis we found, that Census is basically a complicated way of estimating the number of expressed genes per cell. It will remain to the user to decide if he/she wants to use census or simply the number of expressed genes (as shown above) as scaling factor.

utils::sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] SimBu_1.9.0    rmarkdown_2.29
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.37.0 gtable_0.3.6               
#>  [3] xfun_0.49                   bslib_0.8.0                
#>  [5] ggplot2_3.5.1               Biobase_2.67.0             
#>  [7] lattice_0.22-6              vctrs_0.6.5                
#>  [9] tools_4.4.2                 generics_0.1.3             
#> [11] stats4_4.4.2                parallel_4.4.2             
#> [13] tibble_3.2.1                fansi_1.0.6                
#> [15] pkgconfig_2.0.3             Matrix_1.7-1               
#> [17] data.table_1.16.2           RColorBrewer_1.1-3         
#> [19] S4Vectors_0.45.2            sparseMatrixStats_1.19.0   
#> [21] lifecycle_1.0.4             GenomeInfoDbData_1.2.13    
#> [23] farver_2.1.2                compiler_4.4.2             
#> [25] munsell_0.5.1               codetools_0.2-20           
#> [27] GenomeInfoDb_1.43.2         htmltools_0.5.8.1          
#> [29] sys_3.4.3                   buildtools_1.0.0           
#> [31] sass_0.4.9                  yaml_2.3.10                
#> [33] pillar_1.9.0                crayon_1.5.3               
#> [35] jquerylib_0.1.4             tidyr_1.3.1                
#> [37] BiocParallel_1.41.0         DelayedArray_0.33.2        
#> [39] cachem_1.1.0                abind_1.4-8                
#> [41] tidyselect_1.2.1            digest_0.6.37              
#> [43] dplyr_1.1.4                 purrr_1.0.2                
#> [45] labeling_0.4.3              maketools_1.3.1            
#> [47] fastmap_1.2.0               grid_4.4.2                 
#> [49] colorspace_2.1-1            cli_3.6.3                  
#> [51] SparseArray_1.7.2           magrittr_2.0.3             
#> [53] S4Arrays_1.7.1              utf8_1.2.4                 
#> [55] withr_3.0.2                 UCSC.utils_1.3.0           
#> [57] scales_1.3.0                XVector_0.47.0             
#> [59] httr_1.4.7                  matrixStats_1.4.1          
#> [61] proxyC_0.4.1                evaluate_1.0.1             
#> [63] knitr_1.49                  GenomicRanges_1.59.1       
#> [65] IRanges_2.41.1              rlang_1.1.4                
#> [67] Rcpp_1.0.13-1               glue_1.8.0                 
#> [69] BiocGenerics_0.53.3         jsonlite_1.8.9             
#> [71] R6_2.5.1                    MatrixGenerics_1.19.0      
#> [73] zlibbioc_1.52.0

References

Finotello, Francesca, Clemens Mayer, Christina Plattner, Gerhard Laschober, Dietmar Rieder, Hubert Hackl, Anne Krogsdam, et al. 2019. “Molecular and Pharmacological Modulators of the Tumor Immune Contexture Revealed by Deconvolution of RNA-Seq Data.” Genome Medicine 11 (1): 34. https://doi.org/10.1186/s13073-019-0638-6.
Qiu, Xiaojie, Andrew Hill, Jonathan Packer, Dejun Lin, Yi-An Ma, and Cole Trapnell. 2017. “Single-Cell mRNA Quantification and Differential Analysis with Census.” Nature Methods 14 (3): 309–15. https://doi.org/10.1038/nmeth.4150.
Racle, Julien, Kaat de Jonge, Petra Baumgaertner, Daniel E. Speiser, and David Gfeller. 2017. “Simultaneous Enumeration of Cancer and Immune Cell Types from Bulk Tumor Gene Expression Data.” eLife 6 (November): e26476. https://doi.org/10.7554/eLife.26476.