Package: qckitfastq 1.23.0

August Guang

qckitfastq: FASTQ Quality Control

Assessment of FASTQ file format with multiple metrics including quality score, sequence content, overrepresented sequence and Kmers.

Authors:Wenyue Xing [aut], August Guang [aut, cre]

# Install 'qckitfastq' in R:

install.packages('qckitfastq', repos = c('https://bioc.r-universe.dev', 'https://cloud.r-project.org'))

Uses libs:

zlib– Compression library
c++– GNU Standard C++ Library v3

On BioConductor:qckitfastq-1.23.0(bioc 3.21)qckitfastq-1.22.0(bioc 3.20)

This package does not link to any Github/Gitlab/R-forge repository. No issue tracker or development information is available.

software qualitycontrol sequencing zlib cpp

4.38 score 24 scripts 267 downloads 26 exports 40 dependencies

Last updated 5 months agofrom:6d6381dbe6. Checks:1 OK, 8 WARNING, 3 NOTE. Indexed: yes.

Target	Result	Latest binary
Doc / Vignettes	OK	Mar 19 2025
R-4.5-win-x86_64	NOTE	Mar 19 2025
R-4.5-mac-x86_64	WARNING	Mar 19 2025
R-4.5-mac-aarch64	WARNING	Mar 19 2025
R-4.5-linux-x86_64	WARNING	Mar 19 2025
R-4.4-win-x86_64	NOTE	Mar 19 2025
R-4.4-mac-x86_64	WARNING	Mar 19 2025
R-4.4-mac-aarch64	WARNING	Mar 19 2025
R-4.4-linux-x86_64	WARNING	Mar 19 2025
R-4.3-win-x86_64	NOTE	Mar 19 2025
R-4.3-mac-x86_64	WARNING	Mar 19 2025
R-4.3-mac-aarch64	WARNING	Mar 19 2025

Exports:adapter_content calc_adapter_content calc_format_score calc_over_rep_seq dimensions find_format GC_content gc_per_read kmer_count overrep_kmer overrep_reads per_base_quality per_read_quality plot_adapter_content plot_GC_content plot_overrep_kmer plot_overrep_reads plot_per_base_quality plot_per_read_quality plot_read_content plot_read_length qual_score_per_read read_base_content read_content read_length run_all

Dependencies:cli colorspace data.table dplyr fansi farver generics ggplot2 glue gtable isoband labeling lattice lifecycle magrittr MASS Matrix mgcv munsell nlme pillar pkgconfig plyr R6 RColorBrewer Rcpp reshape2 rlang RSeqAn scales seqTools stringi stringr tibble tidyselect utf8 vctrs viridisLite withr zlibbioc

Introduction to qckitfastq

August Guang and Wenyue Xing

Rendered fromvignette-qckitfastq.Rmdusingknitr::rmarkdownon Mar 19 2025.

Last update: 2019-07-22
Started: 2018-05-01

Citation

Development and contributors

Readme and manuals

Help Manual

Help page	Topics
Creates a sorted from most frequent to least frequent abundance table of adapters that are found to be present in the reads at greater than 0.1% of the reads. If output_file is selected then will save the entire set of adapters and counts. Only available for macOS/Linux due to dependency on C++14.	adapter_content
Compute adapter content in reads. This function is only available for macOS/Linux.	calc_adapter_content
Calculate score based on Illumina format	calc_format_score
Calculate sequece counts for each unique sequence and create a table with unique sequences and corresponding counts	calc_over_rep_seq
Extract the number of columns and rows for a FASTQ file using seqTools.	dimensions
Gets quality score encoding format from the FASTQ file. Return possibilities are Sanger(/Illumina1.8), Solexa(/Illumina1.0), Illumina1.3, and Illumina1.5. This encoding is heuristic based and may not be 100 since there is overlap in the encodings used, so it is best if you already know the format.	find_format
Calculates GC content percentage for each read in the dataset.	GC_content
Calculate GC nucleotide sequence content per read of the FASTQ gzipped file	gc_per_read
Return kmer count per sequence for the length of kmer desired	kmer_count
Generate overrepresented kmers of length k based on their observed to expected ratio at each position across all sequences in the dataset. The expected proportion of a length k kmer assumes site independence and is computed as the sum of the count of each base pair in the kmer times the probability of observing that base pair in the data set, i.e. P(A)count_in_kmer(A)+P(C)count_in_kmer(C)+... The observed to expected ratio is computed as log2(obs/exp). Those with obsexp_ratio > 2 are considered to be overrepresented and appear in the returned data frame along with their position in the sequence.	overrep_kmer
Sort all sequences per read by count.	overrep_reads
Compute the mean, median, and percentiles of quality score per base. This is returned as a data frame.	per_base_quality
Compute the mean quality score per read. 'per_read_quality'	per_read_quality
Creates a bar plot of the top 5 most present adapter sequences.	plot_adapter_content
Generate mean GC content histogram.	plot_GC_content
Determine how to plot outliers. Heuristic used is whether their obsexp_ratio differs by more than 1 and whether they fall into the same bin or not. If for 2 outliers, obsexp_ratio differs by less than .4 and they are in the same bin, then combine into a single plotting point. NOT FULLY FUNCTIONAL	plot_outliers
Create a box plot of the log2(observed/expected) ratio across the length of the sequence as well as top overrepresented kmers. Only ratios greater than 2 are included in the box plot. Default is 20 bins across the length of the sequence and the top 2 overrepresented kmers, but this can be changed by the user.	plot_overrep_kmer
Plot the top 5 seqeunces	plot_overrep_reads
Generate a boxplot of the per position quality score.	plot_per_base_quality
Plot the mean quality score per sequence as a histogram. High quality sequences are those mostly distributed over 30. Low quality sequences are those mostly under 30. 'plot_per_read_quality'	plot_per_read_quality
Plot the per position nucleotide content.	plot_read_content
Plot a histogram of the number of reads with each read length.	plot_read_length
Calculate the mean quality score per read of the FASTQ gzipped file	qual_score_per_read
Compute nucleotide content per position for a single base pair. Wrapper function around seqTools.	read_base_content
Compute nucleotide content per position. Wrapper function around seqTools.	read_content
Creates a data frame of read lengths and the number of reads with that read length.	read_length
Will run all functions in the qckitfastq suite and save the data frames and plots to a user-provided directory. Plot names are supplied by default.	run_all