The countsimQC
package provides a simple way to compare
the characteristic features of a collection of (e.g., RNA-seq) count
data sets. An important application is in situations where a synthetic
count data set has been generated using a real count data set as an
underlying source of parameters, in which case it is often important to
verify that the final synthetic data captures the main features of the
original data set. However, the package can be used to create a visual
overview of any collection of one or more count data sets.
In this vignette we will show how to generate a comparative report
from a collection of two simulated data sets and the original,
underlying real data set. First, we load the object containing the three
data sets. The object is a named list, where each element is a
DESeqDataSet
object, containing the count matrix, a sample
information data frame and a model formula (necessary to calculate
dispersions). For more information about the DESeqDataSet
class, please see the DESeq2
Bioconductor package. For speed reasons, we use only a subset of the
features in each data set for the following calculations.
suppressPackageStartupMessages({
library(countsimQC)
library(DESeq2)
})
data(countsimExample)
countsimExample
## $Original
## class: DESeqDataSet
## dim: 10000 11
## metadata(1): version
## assays(1): counts
## rownames(10000): ENSMUSG00000000001.4 ENSMUSG00000000028.14 ...
## ENSMUSG00000048027.7 ENSMUSG00000048029.10
## rowData names(0):
## colnames(11): GSM1923445 GSM1923446 ... GSM1923578 GSM1923579
## colData names(2): group sample
##
## $Sim1
## class: DESeqDataSet
## dim: 10000 11
## metadata(1): version
## assays(1): counts
## rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
## rowData names(0):
## colnames(11): Cell1 Cell2 ... Cell88 Cell89
## colData names(4): Cell Batch Group ExpLibSize
##
## $Sim2
## class: DESeqDataSet
## dim: 10000 11
## metadata(1): version
## assays(1): counts
## rownames(10000): Gene1 Gene2 ... Gene9999 Gene10000
## rowData names(0):
## colnames(11): Cell1 Cell2 ... Cell88 Cell89
## colData names(3): Cell CellFac Group
Next, we generate the report using the
countsimQCReport()
function. Depending on the level of
detail and the type of information that are required for the final
report, this function can be run in different “modes”:
calculateStatistics = FALSE
, only plots will
be generated. This is the fastest way of running
countsimQCReport()
, and in many cases generates enough
information for the user to make a visual evaluation of the count data
set(s).calculateStatistics = TRUE
and
permutationPvalues = FALSE
, some quantitative pairwise
comparisons between data sets will be performed. In particular, the
Kolmogorov-Smirnov test and the Wald-Wolfowitz runs test will be used to
compare distributions, and additional statistics will be calculated to
evaluate how similar the evaluated aspects are between pairs of data
sets.calculateStatistics = TRUE
and
permutationPvalues = TRUE
(and giving the requested number
of permutations via the nPermutations
argument),
permutation of data set labels will be used to evaluate the significance
of the statistics calculated in the previous point. Naturally, this
increases the run time of the analysis considerably.Here, for the sake of speed, we calculate statistics for a small
subset of the observations (subsampleSize = 25
) and refrain
from calculating permutation p-values.
tempDir <- tempdir()
countsimQCReport(ddsList = countsimExample, outputFile = "countsim_report.html",
outputDir = tempDir, outputFormat = "html_document",
showCode = FALSE, forceOverwrite = TRUE,
savePlots = TRUE, description = "This is my test report.",
maxNForCorr = 25, maxNForDisp = Inf,
calculateStatistics = TRUE, subsampleSize = 25,
kfrac = 0.01, kmin = 5,
permutationPvalues = FALSE, nPermutations = NULL)
The countsimQCReport()
function can generate either an
HTML file (by setting outputFormat = "html_document"
or
outputFormat = NULL
) or a pdf file (by setting
outputFormat = "pdf_document"
). The
description
argument can be used to provide a more
extensive description of the data set(s) that are included in the
report.
If the argument savePlots
is set to TRUE, an .rds file
containing the individual ggplot objects will be generated. These
objects can be used to perform fine-tuning of the visualizations if
desired. Note, however, that the .rds file can become large if the
number of data sets is large, or if the individual data sets have many
samples or features. The convenience function
generateIndividualPlots()
can be used to quickly generate
individual figures for all plots included in the report, using a variety
of devices. For example, to generate each plot in pdf format:
ggplots <- readRDS(file.path(tempDir, "countsim_report_ggplots.rds"))
if (!dir.exists(file.path(tempDir, "figures"))) {
dir.create(file.path(tempDir, "figures"))
}
generateIndividualPlots(ggplots, device = "pdf", nDatasets = 3,
outputDir = file.path(tempDir, "figures"))
## `geom_smooth()` using formula = 'y ~ x'
In the example above, all data sets were provided as
DESeqDataSet
objects. The advantage of this is that it
allows the specification of the experimental design, which is used in
the dispersion calculations. countsimQC
also allows a data
set to be provided as either a data.frame
or a
matrix
. However, in these situations, it will be assumed
that all samples are replicates (i.e., a design ~1
). An
example is provided in the countsimExample_dfmat
data set,
provided with the package.
## [1] "Original" "Sim1" "Sim2"
## $Original
## [1] "DESeqDataSet"
## attr(,"package")
## [1] "DESeq2"
##
## $Sim1
## [1] "matrix" "array"
##
## $Sim2
## [1] "data.frame"
tempDir <- tempdir()
countsimQCReport(ddsList = countsimExample_dfmat,
outputFile = "countsim_report_dfmat.html",
outputDir = tempDir, outputFormat = "html_document",
showCode = FALSE, forceOverwrite = TRUE,
savePlots = TRUE, description = "This is my test report.",
maxNForCorr = 25, maxNForDisp = Inf,
calculateStatistics = TRUE, subsampleSize = 25,
kfrac = 0.01, kmin = 5,
permutationPvalues = FALSE, nPermutations = NULL)
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] DESeq2_1.47.1 SummarizedExperiment_1.37.0
## [3] Biobase_2.67.0 MatrixGenerics_1.19.0
## [5] matrixStats_1.4.1 GenomicRanges_1.59.1
## [7] GenomeInfoDb_1.43.2 IRanges_2.41.2
## [9] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [11] generics_0.1.3 countsimQC_1.25.0
## [13] rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2 dplyr_1.1.4
## [4] blob_1.2.4 bitops_1.0-9 Biostrings_2.75.3
## [7] fastmap_1.2.0 XML_3.99-0.17 digest_0.6.37
## [10] lifecycle_1.0.4 survival_3.8-3 statmod_1.5.0
## [13] KEGGREST_1.47.0 RSQLite_2.3.9 magrittr_2.0.3
## [16] genefilter_1.89.0 compiler_4.4.2 rlang_1.1.4
## [19] sass_0.4.9 tools_4.4.2 yaml_2.3.10
## [22] knitr_1.49 labeling_0.4.3 S4Arrays_1.7.1
## [25] htmlwidgets_1.6.4 bit_4.5.0.1 DelayedArray_0.33.3
## [28] abind_1.4-8 BiocParallel_1.41.0 withr_3.0.2
## [31] purrr_1.0.2 sys_3.4.3 grid_4.4.2
## [34] caTools_1.18.3 xtable_1.8-4 colorspace_2.1-1
## [37] edgeR_4.5.1 ggplot2_3.5.1 scales_1.3.0
## [40] cli_3.6.3 crayon_1.5.3 ragg_1.3.3
## [43] httr_1.4.7 DBI_1.2.3 cachem_1.1.0
## [46] splines_4.4.2 parallel_4.4.2 AnnotationDbi_1.69.0
## [49] XVector_0.47.1 vctrs_0.6.5 Matrix_1.7-1
## [52] jsonlite_1.8.9 bit64_4.5.2 crosstalk_1.2.1
## [55] systemfonts_1.1.0 maketools_1.3.1 locfit_1.5-9.10
## [58] limma_3.63.2 jquerylib_0.1.4 tidyr_1.3.1
## [61] annotate_1.85.0 glue_1.8.0 randtests_1.0.2
## [64] codetools_0.2-20 DT_0.33 gtable_0.3.6
## [67] UCSC.utils_1.3.0 munsell_0.5.1 tibble_3.2.1
## [70] pillar_1.10.0 htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
## [73] R6_2.5.1 textshaping_0.4.1 evaluate_1.0.1
## [76] lattice_0.22-6 png_0.1-8 memoise_2.0.1
## [79] bslib_0.8.0 Rcpp_1.0.13-1 nlme_3.1-166
## [82] SparseArray_1.7.2 mgcv_1.9-1 xfun_0.49
## [85] buildtools_1.0.0 pkgconfig_2.0.3