Package: qckitfastq 1.21.0

August Guang

qckitfastq: FASTQ Quality Control

Assessment of FASTQ file format with multiple metrics including quality score, sequence content, overrepresented sequence and Kmers.

Authors:Wenyue Xing [aut], August Guang [aut, cre]

qckitfastq_1.21.0.tar.gz
qckitfastq_1.21.0.zip(r-4.5)qckitfastq_1.21.0.zip(r-4.4)qckitfastq_1.21.0.zip(r-4.3)
qckitfastq_1.21.0.tgz(r-4.4-arm64)qckitfastq_1.21.0.tgz(r-4.4-x86_64)qckitfastq_1.21.0.tgz(r-4.3-arm64)qckitfastq_1.21.0.tgz(r-4.3-x86_64)
qckitfastq_1.21.0.tar.gz(r-4.5-noble)qckitfastq_1.21.0.tar.gz(r-4.4-noble)
qckitfastq_1.21.0.tgz(r-4.4-emscripten)qckitfastq_1.21.0.tgz(r-4.3-emscripten)
qckitfastq.pdf |qckitfastq.html
qckitfastq/json (API)
NEWS

# Install 'qckitfastq' in R:
install.packages('qckitfastq', repos = c('https://bioc.r-universe.dev', 'https://cloud.r-project.org'))

Peer review:

Uses libs:
  • zlib– Compression library
  • c++– GNU Standard C++ Library v3

On BioConductor:qckitfastq-1.21.0(bioc 3.20)qckitfastq-1.20.0(bioc 3.19)

This package does not link to any Github/Gitlab/R-forge repository. No issue tracker or development information is available.

bioconductor-package

26 exports 0.71 score 40 dependencies

Last updated 2 months agofrom:1e671c4706

Exports:adapter_contentcalc_adapter_contentcalc_format_scorecalc_over_rep_seqdimensionsfind_formatGC_contentgc_per_readkmer_countoverrep_kmeroverrep_readsper_base_qualityper_read_qualityplot_adapter_contentplot_GC_contentplot_overrep_kmerplot_overrep_readsplot_per_base_qualityplot_per_read_qualityplot_read_contentplot_read_lengthqual_score_per_readread_base_contentread_contentread_lengthrun_all

Dependencies:clicolorspacedata.tabledplyrfansifarvergenericsggplot2gluegtableisobandlabelinglatticelifecyclemagrittrMASSMatrixmgcvmunsellnlmepillarpkgconfigplyrR6RColorBrewerRcppreshape2rlangRSeqAnscalesseqToolsstringistringrtibbletidyselectutf8vctrsviridisLitewithrzlibbioc

Introduction to qckitfastq

Rendered fromvignette-qckitfastq.Rmdusingknitr::rmarkdownon Jun 30 2024.

Last update: 2019-07-22
Started: 2018-05-01

Readme and manuals

Help Manual

Help pageTopics
Creates a sorted from most frequent to least frequent abundance table of adapters that are found to be present in the reads at greater than 0.1% of the reads. If output_file is selected then will save the entire set of adapters and counts. Only available for macOS/Linux due to dependency on C++14.adapter_content
Compute adapter content in reads. This function is only available for macOS/Linux.calc_adapter_content
Calculate score based on Illumina formatcalc_format_score
Calculate sequece counts for each unique sequence and create a table with unique sequences and corresponding countscalc_over_rep_seq
Extract the number of columns and rows for a FASTQ file using seqTools.dimensions
Gets quality score encoding format from the FASTQ file. Return possibilities are Sanger(/Illumina1.8), Solexa(/Illumina1.0), Illumina1.3, and Illumina1.5. This encoding is heuristic based and may not be 100 since there is overlap in the encodings used, so it is best if you already know the format.find_format
Calculates GC content percentage for each read in the dataset.GC_content
Calculate GC nucleotide sequence content per read of the FASTQ gzipped filegc_per_read
Return kmer count per sequence for the length of kmer desiredkmer_count
Generate overrepresented kmers of length k based on their observed to expected ratio at each position across all sequences in the dataset. The expected proportion of a length k kmer assumes site independence and is computed as the sum of the count of each base pair in the kmer times the probability of observing that base pair in the data set, i.e. P(A)count_in_kmer(A)+P(C)count_in_kmer(C)+... The observed to expected ratio is computed as log2(obs/exp). Those with obsexp_ratio > 2 are considered to be overrepresented and appear in the returned data frame along with their position in the sequence.overrep_kmer
Sort all sequences per read by count.overrep_reads
Compute the mean, median, and percentiles of quality score per base. This is returned as a data frame.per_base_quality
Compute the mean quality score per read. 'per_read_quality'per_read_quality
Creates a bar plot of the top 5 most present adapter sequences.plot_adapter_content
Generate mean GC content histogram.plot_GC_content
Determine how to plot outliers. Heuristic used is whether their obsexp_ratio differs by more than 1 and whether they fall into the same bin or not. If for 2 outliers, obsexp_ratio differs by less than .4 and they are in the same bin, then combine into a single plotting point. NOT FULLY FUNCTIONALplot_outliers
Create a box plot of the log2(observed/expected) ratio across the length of the sequence as well as top overrepresented kmers. Only ratios greater than 2 are included in the box plot. Default is 20 bins across the length of the sequence and the top 2 overrepresented kmers, but this can be changed by the user.plot_overrep_kmer
Plot the top 5 seqeuncesplot_overrep_reads
Generate a boxplot of the per position quality score.plot_per_base_quality
Plot the mean quality score per sequence as a histogram. High quality sequences are those mostly distributed over 30. Low quality sequences are those mostly under 30. 'plot_per_read_quality'plot_per_read_quality
Plot the per position nucleotide content.plot_read_content
Plot a histogram of the number of reads with each read length.plot_read_length
Calculate the mean quality score per read of the FASTQ gzipped filequal_score_per_read
Compute nucleotide content per position for a single base pair. Wrapper function around seqTools.read_base_content
Compute nucleotide content per position. Wrapper function around seqTools.read_content
Creates a data frame of read lengths and the number of reads with that read length.read_length
Will run all functions in the qckitfastq suite and save the data frames and plots to a user-provided directory. Plot names are supplied by default.run_all