countsimQC - Comparing characteristic features across count data sets

Introduction

The countsimQC package provides a simple way to compare the characteristic features of a collection of (e.g., RNA-seq) count data sets. An important application is in situations where a synthetic count data set has been generated using a real count data set as an underlying source of parameters, in which case it is often important to verify that the final synthetic data captures the main features of the original data set. However, the package can be used to create a visual overview of any collection of one or more count data sets.

Data preparation

In this vignette we will show how to generate a comparative report from a collection of two simulated data sets and the original, underlying real data set. First, we load the object containing the three data sets. The object is a named list, where each element is a DESeqDataSet object, containing the count matrix, a sample information data frame and a model formula (necessary to calculate dispersions). For more information about the DESeqDataSet class, please see the DESeq2 Bioconductor package. For speed reasons, we use only a subset of the features in each data set for the following calculations.

suppressPackageStartupMessages({
  library(countsimQC)
  library(DESeq2)
})

data(countsimExample)
countsimExample
## $Original
## class: DESeqDataSet 
## dim: 10000 11 
## metadata(1): version
## assays(1): counts
## rownames(10000): ENSMUSG00000000001.4 ENSMUSG00000000028.14 ...
##   ENSMUSG00000048027.7 ENSMUSG00000048029.10
## rowData names(0):
## colnames(11): GSM1923445 GSM1923446 ... GSM1923578 GSM1923579
## colData names(2): group sample
## 
## $Sim1
## class: DESeqDataSet 
## dim: 10000 11 
## metadata(1): version
## assays(1): counts
## rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
## rowData names(0):
## colnames(11): Cell1 Cell2 ... Cell88 Cell89
## colData names(4): Cell Batch Group ExpLibSize
## 
## $Sim2
## class: DESeqDataSet 
## dim: 10000 11 
## metadata(1): version
## assays(1): counts
## rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
## rowData names(0):
## colnames(11): Cell1 Cell2 ... Cell88 Cell89
## colData names(3): Cell CellFac Group
countsimExample <- lapply(countsimExample, function(cse) {
  cse[seq_len(1500), ]
})

Report generation

Next, we generate the report using the countsimQCReport() function. Depending on the level of detail and the type of information that are required for the final report, this function can be run in different “modes”:

  • by setting calculateStatistics = FALSE, only plots will be generated. This is the fastest way of running countsimQCReport(), and in many cases generates enough information for the user to make a visual evaluation of the count data set(s).
  • by setting calculateStatistics = TRUE and permutationPvalues = FALSE, some quantitative pairwise comparisons between data sets will be performed. In particular, the Kolmogorov-Smirnov test and the Wald-Wolfowitz runs test will be used to compare distributions, and additional statistics will be calculated to evaluate how similar the evaluated aspects are between pairs of data sets.
  • by setting calculateStatistics = TRUE and permutationPvalues = TRUE (and giving the requested number of permutations via the nPermutations argument), permutation of data set labels will be used to evaluate the significance of the statistics calculated in the previous point. Naturally, this increases the run time of the analysis considerably.

Here, for the sake of speed, we calculate statistics for a small subset of the observations (subsampleSize = 25) and refrain from calculating permutation p-values.

tempDir <- tempdir()
countsimQCReport(ddsList = countsimExample, outputFile = "countsim_report.html",
                 outputDir = tempDir, outputFormat = "html_document", 
                 showCode = FALSE, forceOverwrite = TRUE,
                 savePlots = TRUE, description = "This is my test report.", 
                 maxNForCorr = 25, maxNForDisp = Inf, 
                 calculateStatistics = TRUE, subsampleSize = 25,
                 kfrac = 0.01, kmin = 5, 
                 permutationPvalues = FALSE, nPermutations = NULL)

The countsimQCReport() function can generate either an HTML file (by setting outputFormat = "html_document" or outputFormat = NULL) or a pdf file (by setting outputFormat = "pdf_document"). The description argument can be used to provide a more extensive description of the data set(s) that are included in the report.

Generation of individual figures

If the argument savePlots is set to TRUE, an .rds file containing the individual ggplot objects will be generated. These objects can be used to perform fine-tuning of the visualizations if desired. Note, however, that the .rds file can become large if the number of data sets is large, or if the individual data sets have many samples or features. The convenience function generateIndividualPlots() can be used to quickly generate individual figures for all plots included in the report, using a variety of devices. For example, to generate each plot in pdf format:

ggplots <- readRDS(file.path(tempDir, "countsim_report_ggplots.rds"))
if (!dir.exists(file.path(tempDir, "figures"))) {
  dir.create(file.path(tempDir, "figures"))
}
generateIndividualPlots(ggplots, device = "pdf", nDatasets = 3, 
                        outputDir = file.path(tempDir, "figures"))
## `geom_smooth()` using formula = 'y ~ x'

Input data format

In the example above, all data sets were provided as DESeqDataSet objects. The advantage of this is that it allows the specification of the experimental design, which is used in the dispersion calculations. countsimQC also allows a data set to be provided as either a data.frame or a matrix. However, in these situations, it will be assumed that all samples are replicates (i.e., a design ~1). An example is provided in the countsimExample_dfmat data set, provided with the package.

data(countsimExample_dfmat)
names(countsimExample_dfmat)
## [1] "Original" "Sim1"     "Sim2"
lapply(countsimExample_dfmat, class)
## $Original
## [1] "DESeqDataSet"
## attr(,"package")
## [1] "DESeq2"
## 
## $Sim1
## [1] "matrix" "array" 
## 
## $Sim2
## [1] "data.frame"
tempDir <- tempdir()
countsimQCReport(ddsList = countsimExample_dfmat, 
                 outputFile = "countsim_report_dfmat.html",
                 outputDir = tempDir, outputFormat = "html_document", 
                 showCode = FALSE, forceOverwrite = TRUE,
                 savePlots = TRUE, description = "This is my test report.", 
                 maxNForCorr = 25, maxNForDisp = Inf, 
                 calculateStatistics = TRUE, subsampleSize = 25,
                 kfrac = 0.01, kmin = 5, 
                 permutationPvalues = FALSE, nPermutations = NULL)

Session info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DESeq2_1.47.0               SummarizedExperiment_1.37.0
##  [3] Biobase_2.67.0              MatrixGenerics_1.19.0      
##  [5] matrixStats_1.4.1           GenomicRanges_1.59.0       
##  [7] GenomeInfoDb_1.43.0         IRanges_2.41.0             
##  [9] S4Vectors_0.45.0            BiocGenerics_0.53.1        
## [11] generics_0.1.3              countsimQC_1.25.0          
## [13] rmarkdown_2.28             
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1        farver_2.1.2            dplyr_1.1.4            
##  [4] blob_1.2.4              bitops_1.0-9            Biostrings_2.75.0      
##  [7] fastmap_1.2.0           XML_3.99-0.17           digest_0.6.37          
## [10] lifecycle_1.0.4         survival_3.7-0          statmod_1.5.0          
## [13] KEGGREST_1.47.0         RSQLite_2.3.7           magrittr_2.0.3         
## [16] genefilter_1.89.0       compiler_4.4.1          rlang_1.1.4            
## [19] sass_0.4.9              tools_4.4.1             utf8_1.2.4             
## [22] yaml_2.3.10             knitr_1.48              labeling_0.4.3         
## [25] S4Arrays_1.7.1          htmlwidgets_1.6.4       bit_4.5.0              
## [28] DelayedArray_0.33.1     abind_1.4-8             BiocParallel_1.41.0    
## [31] withr_3.0.2             purrr_1.0.2             sys_3.4.3              
## [34] grid_4.4.1              fansi_1.0.6             caTools_1.18.3         
## [37] xtable_1.8-4            colorspace_2.1-1        edgeR_4.5.0            
## [40] ggplot2_3.5.1           scales_1.3.0            cli_3.6.3              
## [43] crayon_1.5.3            ragg_1.3.3              httr_1.4.7             
## [46] DBI_1.2.3               cachem_1.1.0            zlibbioc_1.52.0        
## [49] splines_4.4.1           parallel_4.4.1          AnnotationDbi_1.69.0   
## [52] XVector_0.47.0          vctrs_0.6.5             Matrix_1.7-1           
## [55] jsonlite_1.8.9          bit64_4.5.2             crosstalk_1.2.1        
## [58] systemfonts_1.1.0       maketools_1.3.1         locfit_1.5-9.10        
## [61] limma_3.63.0            tidyr_1.3.1             jquerylib_0.1.4        
## [64] annotate_1.85.0         glue_1.8.0              randtests_1.0.2        
## [67] codetools_0.2-20        DT_0.33                 gtable_0.3.6           
## [70] UCSC.utils_1.3.0        munsell_0.5.1           tibble_3.2.1           
## [73] pillar_1.9.0            htmltools_0.5.8.1       GenomeInfoDbData_1.2.13
## [76] R6_2.5.1                textshaping_0.4.0       evaluate_1.0.1         
## [79] lattice_0.22-6          highr_0.11              png_0.1-8              
## [82] memoise_2.0.1           bslib_0.8.0             Rcpp_1.0.13-1          
## [85] nlme_3.1-166            SparseArray_1.7.0       mgcv_1.9-1             
## [88] xfun_0.49               buildtools_1.0.0        pkgconfig_2.0.3