The package ngsReports
is designed to bolt into data
processing pipelines and produce combined plots for multiple FastQC
reports generated across an entire set of libraries or samples. The
primary functionality of the package is parsing FastQC reports, with
import methods also implemented for log files produced by tools as as
STAR
, hisat2
and others. In addition to
parsing files, default plotting methods are implemented. Plots applied
to a single file will replicate the default plots from
FastQC
1, whilst methods applied to multiple FastQC
reports summarise these and produce a series of custom plots.
Plots are produced as standard ggplot2 objects, with an interactive option available using plotly. As well as custom summary plots, tables of read counts and the like can also be easily generated.
In addition to the usage demonstrated below, a shiny
app
has been developed for interactive viewing of FastQC reports. This can
be installed using:
remotes::install_github("UofABioinformaticsHub/shinyNgsreports")
A vignette for this app will be installed with the
shinyNgsreports
package.
In it’s simplest form, a default summary report can be generated
simply by specifying a directory containing the output from FastQC and
calling the function writeHtmlReport()
.
fileDir <- file.path("path", "to", "your", "FastQC", "Reports")
writeHtmlReport(fileDir)
This function will transfer the default template to the provided
directory and produce a single .html
file containing
interactive summary plots of any FastQC output found in the directory.
FastQC output can be *fastqc.zip
files or the same files
extracted as individual directories.
The default template is provided as
ngsReports_Fastqc.Rmd
in the package directory . This
template can be easily modified and supplied as an alternate template to
the above function using your modified file as the template RMarkdown
file.
altTemplate <- file.path("path", "to", "your", "new", "template.Rmd")
writeHtmlReport(fileDir, template = altTemplate)
The package ngsReports
introduces two main
S4
classes:
FastqcData
& FastqcDataList
FastqcData
objects hold the parsed data from a
single report as generated by the stand-alone tool
FastQC
. These are then extended into lists for more
than one file as a FastqcDataList
. For most users, the
primary class of interest will be the FastqcDataList
.
R
To load a set of FastQC
reports into R
as a
FastqcDataList
, specify the vector of file paths, then call
the function FastqcDataList()
. In the rare case you’d like
an individual file, this can be performed by calling
FastqcData()
on an individual file, or subsetting the
output from FastqcDataList()
using the [[]]
operator as with any list object.
fileDir <- system.file("extdata", package = "ngsReports")
files <- list.files(fileDir, pattern = "fastqc.zip$", full.names = TRUE)
fdl <- FastqcDataList(files)
From here, all FastQC modules can be obtained as a
tibble
(i.e. data.frame
) using the function
getModule()
and choosing one of the following modules:
Summary
(The PASS/WARN/FAIL status for each
module)Basic_Statistics
Per_base_sequence_quality
Per_sequence_quality_scores
Per_base_sequence_content
Per_sequence_GC_content
Per_base_N_content
Sequence_Length_Distribution
Sequence_Duplication_Levels
Overrepresented_sequences
Adapter_Content
Kmer_Content
Per_tile_sequence_quality
## # A tibble: 12 × 3
## Filename Status Category
## <chr> <chr> <chr>
## 1 ATTG_R1.fastq PASS Basic Statistics
## 2 ATTG_R1.fastq FAIL Per base sequence quality
## 3 ATTG_R1.fastq WARN Per tile sequence quality
## 4 ATTG_R1.fastq PASS Per sequence quality scores
## 5 ATTG_R1.fastq FAIL Per base sequence content
## 6 ATTG_R1.fastq FAIL Per sequence GC content
## 7 ATTG_R1.fastq PASS Per base N content
## 8 ATTG_R1.fastq PASS Sequence Length Distribution
## 9 ATTG_R1.fastq FAIL Sequence Duplication Levels
## 10 ATTG_R1.fastq FAIL Overrepresented sequences
## 11 ATTG_R1.fastq FAIL Adapter Content
## 12 ATTG_R1.fastq FAIL Kmer Content
Capitalisation and spelling of these module names follows the default patterns from FastQC reports with spaces replaced by underscores. One additional module is available and taken directly from the text within the supplied reports
Total_Duplicated_Percentage
In addition, the read totals for each file in the library can be
obtained using readTotals()
, which can be easily used to
make a table of read totals. This essentially just returns the first two
columns from getModule(x, "Basic_Statistics")
.
The packages dplyr
and pander
can also be
extremely useful for manipulating and displaying imported data. To show
only the R1 read totals, you could do the following
library(dplyr)
library(pander)
reads %>%
dplyr::filter(grepl("R1", Filename)) %>%
pander(
big.mark = ",",
caption = "Read totals from R1 libraries",
justify = "lr"
)
Filename | Total_Sequences |
---|---|
ATTG_R1.fastq | 24,978 |
CCGC_R1.fastq | 22,269 |
GACC_R1.fastq | 10,287 |
Plots created from a single FastqcData
object will
resemble those generated by the FastQC
tool, whilst those
created from a FastqcDataList
will be combined summaries
across a library of files. In addition, all plots are able to be
generated as interactive plots using the argument
usePlotly = TRUE
.
All FastQC modules have been enabled for plotting using default
S4
dispatch, with the exception of
Per_tile_sequence_quality
.
The simplest of the plots is to summarise the
PASS/WARN/FAIL
flags as produced by FastQC
for
each module. This plot can be simply generated using
plotSummary()
The next most informative plot may be to summarise the total numbers
of reads in each associated Fastq file. By default, the number of
duplicated sequences from the Total_Duplicated_Percentage
module are shown, but this can be disabled by setting
duplicated = FALSE
.
As these are ggplot2
objects, the output can be modified
easily using conventional ggplot2
syntax. Here we’ll move
the legend to the top right as an example.
Turning to the Per base sequence quality
scores is the
next most common step for most researchers, and these can be obtained
for an individual file by selecting this as an element
(i.e. FastqcData
object ) of the main
FastqcDataList
object. This plot replicates the default
plots from a FastQC report.
When working with multiple FastQC reports, these are summarised as a heatmap using the mean quality score at each position.
Boxplots of any combinations can also be drawn from a
FastqcDataList
by setting the argument
plotType = "boxplot"
. However, this may be not suitable for
datasets with a large number of libraries.
Similarly, the Mean Sequence Quality Per Read plot can be generated
to replicate plots from a FastQC report by selecting the individual file
from the FastqcDataList
object.
A heatmap of mean sequence qualities can be generated when inspecting multiple reports.
An alternative view may be to plot these as overlaid lines, which can
be simply done by setting plotType = "line"
. Again,
discretion should be shown when choosing this option for many
samples.
The Per_base_sequence_content
module can also be plotted
for an individual report with the layout being identical to that from
FastQC.
These are then combined across multiple files as a heatmap showing a
composite colour for each position. Colours are combined using
rgb()
with A
,C
, G
and T
being represented by green, blue, black and red
respectively.
NB These plots can be very informative setting the
argument usePlotly = TRUE
, however they can be slow to
render given the nature of how the package plotly
renders
graphics.
Again, supplying multiple files and setting
plotType = "line"
will replicate multiple individual plots
from the original FastQC reports. The argument nc
is passed
to facet_wrap()
from the package ggplot2
to
determine the number of columns in the final plot.
Adapter content as identified by FastQC is also able to be plotted for an individual file.
When producing a heatmap across a set of FastQC reports, this will default to Total Adapter Content, instead of showing the individual adapter types.
As with all modules, the Sequence Duplication Levels plot is able to be replicated for an individual file.
When plotting across multiple FastQC reports, duplication levels are
shown as a heatmap based on each default bin included in the initial
FastQC reports. By default, the plotted values are the “Pre”
de-duplication values. Note that values are converted to percentages
instead of read numbers to ensure comparability across files. In the
plot below, CCGC_R2
shows very low duplication levels,
whilst ATTG_R1
shows high levels of duplication. The
commonly observed ‘spikes’ around >10
are also evident
as the larger red blocks.
A selection of Theoretical GC content is supplied with the package in
the object gcTheoretical
, which has been defined with the
additional S4
class TheoreticalGC
. GC content
was calculated using scripts obtained from
https://github.com/mikelove/fastqcTheoreticalGC.
Available genomes and transcriptomes can be obtained using the function
gcAvail()
on the object gcTheoretical
and
specifying the type.
## # A tibble: 24 × 4
## Name Group Source Version
## <chr> <chr> <chr> <chr>
## 1 Alyrata Plants JGI V1.0
## 2 Amellifera Animals UCSC apiMel2
## 3 Athaliana Plants TAIR Arapot11
## 4 Btaurus Animals UCSC bosTau8
## 5 Celegans Animals UCSC ce11
## 6 Cfamiliaris Animals UCSC canFam3
## 7 Dmelanogaster Animals UCSC dm6
## 8 Drerio Animals UCSC danRer10
## 9 Ecoli Bacteria NCBI 2008/08/05
## 10 Gaculeatus Other UCSC gasAcu1
## # ℹ 14 more rows
As with all modules, data for an individual file replicates the
default plot from a FastQC report, but with one key difference. This is
that the Theoretical GC content has been provided in the object
gcTheoretical
based on 100bp reads. This empirically
determined content is shown as the Theoretical GC content line.
Again, data is summarised as a heatmap when plotting across multiple reports, with the default value being the difference between the observed and the theoretical GC content.
Line plots can also be produce an alternative viewpoint, with read totals displayed as percentages instead of raw counts.
Customized theoretical GC content can be generated using input DNA sequences from a supplied fasta file.
When inspecting the Overrepresented Sequence module, the top
n
can be plotted for an individual file, again broken down
by their possible source, and coloured based on their
WARN/FAIL
status.
When applying this across multiple files, instead of identifying common sequences across a set of libraries, overrepresented sequences are summarised by their possible source as defined by FastQC.
In addition to the above, the most abundant n
overrepresented sequences can be exported as a FASTA file for easy
submission to blast
.
A selection of log files as produced by tools such as
STAR
, hisat2
, bowtie
,
bowtie2
and picard duplicationMetrics
, can be
easily imported using the importNgsLogs()
function.
Tool can be specified by the user using the argument
type
, however if no type
is provided we will
attempt to auto-detect from the file’s structure. Note: only a single
log file type can be imported at any time.
The importNgsLogs()
function currently supports log
files from the following tools.
Adapter removal and trimming
Mapping and alignment
Transcript/gene quantification
Genome assembly
fl <- c("Sample1.trimmomaticPE.txt")
trimmomaticLogs <- system.file("extdata", fl, package = "ngsReports")
df <- importNgsLogs(trimmomaticLogs)
df %>%
dplyr::select("Filename", contains("Surviving"), "Dropped") %>%
pander(
split.tables = Inf,
style = "rmarkdown",
big.mark = ",",
caption = "Select columns as an example of output from trimmomatic."
)
Filename | Both_Surviving | Forward_Only_Surviving | Reverse_Only_Surviving | Dropped |
---|---|---|---|---|
Sample1.trimmomaticPE.txt | 38243521 | 2791713 | 1464192 | 1663808 |
Bowtie log files can be parsed and imported
fls <- c("bowtiePE.txt", "bowtieSE.txt")
bowtieLogs <- system.file("extdata", fls, package = "ngsReports")
df <- importNgsLogs(bowtieLogs, type = "bowtie")
df %>%
dplyr::select("Filename", starts_with("Reads")) %>%
pander(
split.tables = Inf,
style = "rmarkdown",
big.mark = ",",
caption = "Select columns as an example of output from bowtie."
)
Filename | Reads_Processed | Reads_With_At_Least_One_Reported_Alignment | Reads_That_Failed_To_Align | Reads_With_Alignments_Suppressed_Due_To_-m |
---|---|---|---|---|
bowtiePE.txt | 42,867,915 | 21,891,058 | 20,748,783 | 228,074 |
bowtieSE.txt | 440,372 | 193,962 | 246,410 | NA |
STAR log files can be parsed and imported
starLog <- system.file("extdata", "log.final.out", package = "ngsReports")
df <- importNgsLogs(starLog, type = "star")
Filename | Uniquely_Mapped_Reads_Number | Uniquely_Mapped_Reads_Percent |
---|---|---|
log.final.out | 68,471,490 | 85.84 |
The output of the samtools flagstat
module can be parsed
and imported
flagstatLog <- system.file("extdata", "flagstat.txt", package = "ngsReports")
df <- importNgsLogs(flagstatLog, type = "flagstat")
Filename | QC-passed | QC-failed | flag |
---|---|---|---|
flagstat.txt | 67,405,447 | 0 | in total |
flagstat.txt | 5,275,834 | 0 | secondary |
flagstat.txt | 0 | 0 | supplementary |
flagstat.txt | 0 | 0 | duplicates |
flagstat.txt | 67,405,447 | 0 | mapped |
flagstat.txt | 62,129,613 | 0 | paired in sequencing |
flagstat.txt | 31,065,014 | 0 | read1 |
flagstat.txt | 31,064,599 | 0 | read2 |
flagstat.txt | 62,128,918 | 0 | properly paired |
flagstat.txt | 62,128,918 | 0 | with itself and mate mapped |
flagstat.txt | 695 | 0 | singletons |
flagstat.txt | 0 | 0 | with mate mapped to a different chr |
flagstat.txt | 0 | 0 | with mate mapped to a different chr (mapQ>=5) |
In addition to the files produced by the above alignment tools, the
output from Duplication Metrics (picard
) can also be
imported. This is imported as a list with a tibble
containing the detailed output in the list element $metrics
and the histogram data included as the second elements.
sysDir <- system.file("extdata", package = "ngsReports")
fl <- list.files(sysDir, "Dedup_metrics.txt", full.names = TRUE)
dupMetrics <- importNgsLogs(fl, type = "duplicationMetrics", which = "metrics")
str(dupMetrics)
## tibble [1 × 10] (S3: tbl_df/tbl/data.frame)
## $ Library : chr "~/myExpt/2_alignedData/bams/Sample1.bam"
## $ Unpaired Reads Examined : int 93
## $ Read Pairs Examined : int 43490914
## $ Secondary Or Supplementary Rds: int 5366030
## $ Unmapped Reads : int 0
## $ Unpaired Read Duplicates : int 79
## $ Read Pair Duplicates : int 11494386
## $ Read Pair Optical Duplicates : int 642648
## $ Percent Duplication : num 0.264
## $ Estimated Library Size : int 69609522
Assembly ‘completeness’ and summary statistic information from the
tools BUSCO and quast can also be plot using the function
plotAssemblyStats()
buscoLog <- system.file("extdata", "short_summary_Dmelanogaster_Busco.txt", package = "ngsReports")
plotAssemblyStats(buscoLog, type = "busco")
fls <- c("quast1.tsv", "quast2.tsv")
quastLog <- system.file("extdata", fls, package = "ngsReports")
plotAssemblyStats(quastLog, type = "quast")
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pander_0.6.5 dplyr_1.1.4 ngsReports_2.9.0
## [4] tibble_3.2.1 patchwork_1.3.0 ggplot2_3.5.1
## [7] BiocGenerics_0.53.2 generics_0.1.3 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.49 bslib_0.8.0
## [4] htmlwidgets_1.6.4 lattice_0.22-6 vctrs_0.6.5
## [7] tools_4.4.2 stats4_4.4.2 fansi_1.0.6
## [10] pkgconfig_2.0.3 data.table_1.16.2 RColorBrewer_1.1-3
## [13] S4Vectors_0.45.1 lifecycle_1.0.4 GenomeInfoDbData_1.2.13
## [16] compiler_4.4.2 farver_2.1.2 stringr_1.5.1
## [19] Biostrings_2.75.1 munsell_0.5.1 GenomeInfoDb_1.43.0
## [22] htmltools_0.5.8.1 sys_3.4.3 buildtools_1.0.0
## [25] sass_0.4.9 yaml_2.3.10 lazyeval_0.2.2
## [28] plotly_4.10.4 pillar_1.9.0 crayon_1.5.3
## [31] jquerylib_0.1.4 tidyr_1.3.1 MASS_7.3-61
## [34] cachem_1.1.0 tidyselect_1.2.1 digest_0.6.37
## [37] stringi_1.8.4 reshape2_1.4.4 purrr_1.0.2
## [40] labeling_0.4.3 maketools_1.3.1 forcats_1.0.0
## [43] fastmap_1.2.0 grid_4.4.2 colorspace_2.1-1
## [46] cli_3.6.3 magrittr_2.0.3 utf8_1.2.4
## [49] withr_3.0.2 scales_1.3.0 UCSC.utils_1.3.0
## [52] lubridate_1.9.3 timechange_0.3.0 rmarkdown_2.29
## [55] XVector_0.47.0 httr_1.4.7 zoo_1.8-12
## [58] evaluate_1.0.1 knitr_1.49 IRanges_2.41.0
## [61] viridisLite_0.4.2 rlang_1.1.4 ggdendro_0.2.0
## [64] Rcpp_1.0.13-1 glue_1.8.0 BiocManager_1.30.25
## [67] jsonlite_1.8.9 plyr_1.8.9 R6_2.5.1
## [70] zlibbioc_1.52.0