Rqc is an optimized tool designed for quality control and assessment of high-throughput sequencing data. It performs parallel processing of entire files and produces a report which contains a set of high-resolution graphics that can be used for quality assessment.
This version of Rqc produces high-quality images for the following statistics:
The main goal of Rqc is to provide graphical tools for quality
assessment of reads contained in FASTQ files. This package is designed
focusing on simplicity of use. Therefore, the Rqc package allows the
user to call one single function called rqc
. The
rqc
method processes a set of input files and generates an
HTML report containing several plots that can be used for quality
assessment.
To access this functionality, the user needs to load Rqc package.
The next step is to determine the location of the FASTQ files that should be analyzed. The example below, uses sample files provided by the ShortRead package, but the user must modify this location accordingly, in order to reflect the actual location of the files that need QA.
The basic usage of the rqc
function requires the
definition of 2 arguments. One, path
, is the location where
the files of interest are saved at (this was defined on the step above).
The other argument, pattern
, is a regular expression that
identifies all files of interest. Below, we use .fastq.gz
to specify that all files containing that string are to be
processed.
At this point, the user’s default Internet browser will open an HTML file. This file is the report generated by Rqc, which, by default, is stored in a temporary directory. A sample report is shown below:
This table describes input files. reads
column can be
total number of reads (sample=FALSE
) or sample size.
filename | pair | format | group | reads | total.reads | path |
---|---|---|---|---|---|---|
ERR127302_1_subset.fastq.gz | 1 | FASTQ | None | 20000 | 20000 | /github/workspace/pkglib/ShortRead/extdata/E-MTAB-1147 |
ERR127302_2_subset.fastq.gz | 2 | FASTQ | None | 20000 | 20000 | /github/workspace/pkglib/ShortRead/extdata/E-MTAB-1147 |
This plot describe an overview of per read mean quality distribution of all files
This plot describes the average quality pattern by showing on the X-axis quality thresholds and on the Y-axis the percentage of reads that exceed that quality level.
This describes the average quality scores for each cycle of sequencing.
This plot shows the proportion of reads that appeared many times.
This heatmap plot shows dstance matrix between top represented reads. This functon only works with one result file (and not a list).
Barplot that presents the distribuition of the lengths of the reads available in the FASTQ file.
Line plot showing the average GC content for every cycle of sequencing.
Bar plot showing the proportion of quality calls per cycle. Colors are presented in a gradient Red-Blue, where red identifies calls of lower quality. This visualization is preferred as it is cleaner than the boxplots described below.
Biplot from Principal Component Analysis (PCA) of cycle-specific read average quality.
## Warning: `geom_hline()`: Ignoring `mapping` because `yintercept` was provided.
## Warning: `geom_vline()`: Ignoring `mapping` because `xintercept` was provided.
Boxplots describing empirical patterns of quality distribution on each cycle of sequencing.
This bar plot describes the proportion of each nucleotide called for every cycle of sequencing.
The line plot shows a more detailed view.
It is important to note that the rqc
function samples 1
million records from the FASTQ files. This can be set by adjusting the
n
argument for this function. If the user desires to have
the file processed as a whole (rather than sampling records from it),
(s)he must set the argument sample
to
FALSE
.
The rqc
function wraps a set of functions to generate a
quick report that can be used for quality assessment. However, users can
perform a step-by-step analysis by using the information described
below.
If one wants to process a set of files that are not located at the
same directory, the users needs to create a vector containing the
absolute path of files. The list.files
function can be
useful.
fastqDir <- system.file(package="ShortRead", "extdata/E-MTAB-1147")
files <- list.files(fastqDir, "fastq.gz", full.names=TRUE)
The example input files are samples from a public data set. These samples are available through the ShortRead package. More information regarding these data can be found on the vignette of that package.
To process the files without generating an HTML report, the user
should use rqcQA
function instead rqc
. This
function receives a vector containing the paths of the input files.
The rqcQA
function returns a list that contains the
required information to create the plots present on the standard HTML
report. Actually, both rqc
and rqcQA
return a
named list of RqcResultSet
objects. This output can be used
directly as input to other Rqc package functions. Examples of functions
that can use these objects are rqcReport
and
plot
. RqcResultSet
is a class that extends
.QA
class of ShortRead package. An
RqcResultSet
object contains information in two different
perspectives: read-specific data and cycle-specific
data. They can be accessed using [[
brackets.
To create a final HTML report, the user can apply the
rqcReport
function to the qa
object.
The report generated by rqcReport
contains the all plots
described on the beginning of this document. By default, it is written
to a temporary directory. This behavior can by modified by setting the
outdir
argument.
The Rqc package performs parallel processing of the samples by
interfacing with the BiocParallel
package. By default, Rqc calls multicoreWorkers
function
that returns the maximum number of workers available. You can change the
number of cores setting workers
parameter on
rqc
and rqcQA
functions. It is possible to
process the input files serially by setting workers=1
.
For each plot generated by Rqc, there is a function that shapes the data appropriately. The shaped information is then used to produce the final plot. The example below shows how the user can access these data to generate plots using other tools.
df <- rqcCycleAverageQualityCalc(qa)
cycle <- as.numeric(levels(df$cycle))[df$cycle]
plot(cycle, df$quality, col = df$filename, xlab='Cycle', ylab='Quality Score')
One can also process a subset of result data using subsets of a list.
Rqc package accepts R Markdown file format as template file for generating custom reports. Markdown is a markup language for web development. R Markdown files are regular Markdown files with R codes. Every chunk of code is executed during compilation done by knitr package. Knitr takes R Markdown file and generates Markdown file merged with R codes and their outputs such as text, tables and figures. Result Markdown file is used by Rqc for generating final report in HTML or PDF formats. The source file of Rqc’s default report is a good reference for writing new template reports. Run code below returns the system file path to this source file.
Basic concepts for writing Markdown documents with R code are available at http://rmarkdown.rstudio.com/. Advanced concepts about the compilation process of R Markdown files are available at http://yihui.name/knitr/. The Bioconductor project provides style guidelines for HTML documents at http://www.bioconductor.org/packages/release/bioc/vignettes/BiocStyle/inst/doc/HtmlStyle.html.
Rqc result data is available inside template files through
rqcResultSet
object. This object is a list of summarized
statistics about input files and it is used by all accessor functions
and plots provided by Rqc package. The rqcReport
function
takes template file path as argument and generates personalized
reports.
The Rqc package provides a simple interface to generate plots often used for quality assessment of high-throughput sequencing data. It uses the standard Bioconductor parallelization framework to add efficiency to data processing. The images produced by the package are high-quality figures that can be directly used on publications.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] Rqc_1.41.0 ggplot2_3.5.1
## [3] ShortRead_1.65.0 GenomicAlignments_1.43.0
## [5] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [7] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [9] Rsamtools_2.23.0 GenomicRanges_1.59.1
## [11] Biostrings_2.75.1 GenomeInfoDb_1.43.1
## [13] XVector_0.47.0 IRanges_2.41.1
## [15] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [17] generics_0.1.3 BiocParallel_1.41.0
## [19] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3 rstudioapi_0.17.1
## [4] jsonlite_1.8.9 magrittr_2.0.3 GenomicFeatures_1.59.1
## [7] farver_2.1.2 rmarkdown_2.29 BiocIO_1.17.0
## [10] zlibbioc_1.52.0 vctrs_0.6.5 memoise_2.0.1
## [13] RCurl_1.98-1.16 base64enc_0.1-3 htmltools_0.5.8.1
## [16] S4Arrays_1.7.1 curl_6.0.1 SparseArray_1.7.2
## [19] Formula_1.2-5 sass_0.4.9 bslib_0.8.0
## [22] htmlwidgets_1.6.4 plyr_1.8.9 cachem_1.1.0
## [25] buildtools_1.0.0 mime_0.12 lifecycle_1.0.4
## [28] pkgconfig_2.0.3 Matrix_1.7-1 R6_2.5.1
## [31] fastmap_1.2.0 GenomeInfoDbData_1.2.13 shiny_1.9.1
## [34] digest_0.6.37 colorspace_2.1-1 AnnotationDbi_1.69.0
## [37] Hmisc_5.2-0 RSQLite_2.3.8 hwriter_1.3.2.1
## [40] labeling_0.4.3 fansi_1.0.6 httr_1.4.7
## [43] abind_1.4-8 compiler_4.4.2 bit64_4.5.2
## [46] withr_3.0.2 htmlTable_2.4.3 backports_1.5.0
## [49] DBI_1.2.3 DelayedArray_0.33.2 rjson_0.2.23
## [52] tools_4.4.2 foreign_0.8-87 httpuv_1.6.15
## [55] nnet_7.3-19 glue_1.8.0 restfulr_0.0.15
## [58] promises_1.3.0 grid_4.4.2 checkmate_2.3.2
## [61] cluster_2.1.6 reshape2_1.4.4 gtable_0.3.6
## [64] BSgenome_1.75.0 ensembldb_2.31.0 data.table_1.16.2
## [67] utf8_1.2.4 pillar_1.9.0 markdown_1.13
## [70] stringr_1.5.1 later_1.3.2 lattice_0.22-6
## [73] rtracklayer_1.67.0 bit_4.5.0 deldir_2.0-4
## [76] biovizBase_1.55.0 maketools_1.3.1 knitr_1.49
## [79] gridExtra_2.3 ProtGenerics_1.39.0 xfun_0.49
## [82] stringi_1.8.4 UCSC.utils_1.3.0 lazyeval_0.2.2
## [85] yaml_2.3.10 evaluate_1.0.1 codetools_0.2-20
## [88] interp_1.1-6 GenomicFiles_1.43.0 tibble_3.2.1
## [91] BiocManager_1.30.25 cli_3.6.3 rpart_4.1.23
## [94] xtable_1.8-4 munsell_0.5.1 jquerylib_0.1.4
## [97] dichromat_2.0-0.1 Rcpp_1.0.13-1 png_0.1-8
## [100] XML_3.99-0.17 parallel_4.4.2 blob_1.2.4
## [103] latticeExtra_0.6-30 jpeg_0.1-10 AnnotationFilter_1.31.0
## [106] bitops_1.0-9 pwalign_1.3.0 VariantAnnotation_1.53.0
## [109] scales_1.3.0 crayon_1.5.3 rlang_1.1.4
## [112] KEGGREST_1.47.0