| Title: | A High-Throughput and Extensible Toolkit for Processing FASTQ Data |
|---|---|
| Description: | High-throughput extensible toolkit for processing FASTQ data. The goal of this package is to empower users to quickly build out small programmatic 'kernels' to define any FASTQ processing task they may need. Builds on Intel TBB’s flow graph to orchestrate concurrent I/O and data processing; throughput can be as fast as compression and disk speed allows. The package also ships with a suite of predefined kernels for common FASTQ tasks. |
| Authors: | Travers Ching [aut, cre, cph] (ORCID: <https://orcid.org/0000-0002-5577-3516>), Yann Collet [ctb, cph] (Author of the bundled zstd library), Facebook, Inc. [cph] (Copyright holder of the bundled zstd code), Reichardt Tino [ctb, cph] (Contributor/copyright holder of bundled zstd code), Skibinski Przemyslaw [ctb, cph] (Contributor/copyright holder of bundled zstd code), Mori Yuta [ctb, cph] (Contributor/copyright holder of bundled zstd code) |
| Maintainer: | Travers Ching <[email protected]> |
| License: | GPL-3 |
| Version: | 1.1.1 |
| Built: | 2026-06-02 06:28:52 UTC |
| Source: | https://github.com/bioc/fraq |
Calculate distances between query sequences and a target under a chosen boundary model using Levenshtein or Hamming distance.
fraq_align( query, target, max_distance = 2147483647L, ambiguity_base = "", boundary = "contains", distance_metric = "lv" )fraq_align( query, target, max_distance = 2147483647L, ambiguity_base = "", boundary = "contains", distance_metric = "lv" )
query |
Character vector or Biostrings |
target |
Character vector or Biostrings |
max_distance |
Integer maximum allowed distance; defaults to .Machine$integer.max. |
ambiguity_base |
Single character to treat as ambiguity when matching, or empty string "" to disable; must be length 0 or 1. |
boundary |
One of "contains", "global", or "starts". |
distance_metric |
One of "lv" (Levenshtein) or "hm" (Hamming). "hm" requires query and target to be the same length. |
A data frame with the Biostrings inputs as the first two columns followed by the alignment metadata.
fraq_align("ACGTNT", "ACGTAT", max_distance = 2L, ambiguity_base = "N", boundary = "contains", distance_metric = "lv") fraq_align(Biostrings::DNAString("ACGT"), Biostrings::DNAString("ACGA"), max_distance = 1L, boundary = "global", distance_metric = "hm")fraq_align("ACGTNT", "ACGTAT", max_distance = 2L, ambiguity_base = "N", boundary = "contains", distance_metric = "lv") fraq_align(Biostrings::DNAString("ACGT"), Biostrings::DNAString("ACGA"), max_distance = 1L, boundary = "global", distance_metric = "hm")
Split input datasets into sequential chunks. Each chunk is
written using output_prefix suffixed with _chunk{N} and the format
indicated by output_suffix.
fraq_chunk(input, output_prefix, output_suffix, chunk_size, nthreads = 1L)fraq_chunk(input, output_prefix, output_suffix, chunk_size, nthreads = 1L)
input |
Character vector of source files/keys. |
output_prefix |
Character vector of prefixes used when naming chunked
outputs. Must be the same length as |
output_suffix |
Character scalar describing the output format; use
|
chunk_size |
Numeric chunk size in reads; each output chunk contains up to this many records (per input stream). |
nthreads |
Integer number of worker threads. |
The suffix mapping follows:
"fastq" -> .fastq
"gz" -> .fastq.gz
"zst" -> .fastq.zst
"fraq" -> .fraq
"mem" -> .mem
Invisibly returns NULL after writing all chunked outputs.
r1 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 25, read_length = 75) fraq_chunk(r1, output_prefix = tempfile("chunked_R1"), output_suffix = "fastq", chunk_size = 10, nthreads = 1L)r1 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 25, read_length = 75) fraq_chunk(r1, output_prefix = tempfile("chunked_R1"), output_suffix = "fastq", chunk_size = 10, nthreads = 1L)
Concatenate one or more FRAQ/FASTQ inputs (plain,
.gz, .zst, .fraq, or .mem) into a single output file.
fraq_concat(input, output, nthreads = 1L)fraq_concat(input, output, nthreads = 1L)
input |
Character vector of input paths/keys to concatenate. Mixed formats are supported. |
output |
Character scalar giving the destination path (or |
nthreads |
Integer number of threads for reading, compression, and writing. |
Invisibly returns NULL after writing the concatenated output.
tmp_dir <- tempdir() inputs <- file.path(tmp_dir, sprintf("reads_%d.fastq", 1:2)) lapply(inputs, generate_random_fastq, n_reads = 10, read_length = 50) out <- file.path(tmp_dir, "all_reads.fastq.gz") fraq_concat(inputs, out, nthreads = 1L)tmp_dir <- tempdir() inputs <- file.path(tmp_dir, sprintf("reads_%d.fastq", 1:2)) lapply(inputs, generate_random_fastq, n_reads = 10, read_length = 50) out <- file.path(tmp_dir, "all_reads.fastq.gz") fraq_concat(inputs, out, nthreads = 1L)
Re-encode sequencing files in any supported FRAQ/FASTQ format. Input and output vectors must be the same length.
fraq_convert(input, output, nthreads = 1L)fraq_convert(input, output, nthreads = 1L)
input |
Character vector of source files/keys. |
output |
Character vector of destination files/keys, same length
as |
nthreads |
Integer number of threads for reading/writing. |
FIFO pipes (paths ending with .fifo) are only available on Unix-like
systems;
on Windows they are not supported and will trigger an error.
Invisibly returns NULL after writing the converted outputs.
src <- tempfile(fileext = ".fastq") generate_random_fastq(src, n_reads = 10, read_length = 50) dest <- tempfile(fileext = ".fastq.gz") fraq_convert(src, dest, nthreads = 1L)src <- tempfile(fileext = ".fastq") generate_random_fastq(src, n_reads = 10, read_length = 50) dest <- tempfile(fileext = ".fastq.gz") fraq_convert(src, dest, nthreads = 1L)
Count occurrences of provided short barcodes in reads,
allowing up to max_distance mismatches. Accepts one or more FASTQ files
(e.g., R1/R2).
fraq_count_barcodes( input, barcodes, max_distance = 1L, allow_revcomp = FALSE, nthreads = 1L )fraq_count_barcodes( input, barcodes, max_distance = 1L, allow_revcomp = FALSE, nthreads = 1L )
input |
Character vector of one or more input FASTQ file paths (e.g., R1 and R2). |
barcodes |
Character vector of barcode sequences to count. |
max_distance |
Integer maximum number of mismatches allowed for a match. |
allow_revcomp |
Logical; if TRUE, also match reverse complements. |
nthreads |
Integer number of threads. |
A data frame with counts per barcode.
r1 <- tempfile(fileext = ".fastq") r2 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 1000, read_length = 100, name_prefix = "read_R1_") generate_random_fastq(r2, n_reads = 1000, read_length = 100, name_prefix = "read_R2_") short_barcodes <- c("ACGT", "TGCA", "GTAC") counts <- fraq_count_barcodes(c(r1, r2), short_barcodes, max_distance = 1L, allow_revcomp = FALSE, nthreads = 1L) countsr1 <- tempfile(fileext = ".fastq") r2 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 1000, read_length = 100, name_prefix = "read_R1_") generate_random_fastq(r2, n_reads = 1000, read_length = 100, name_prefix = "read_R2_") short_barcodes <- c("ACGT", "TGCA", "GTAC") counts <- fraq_count_barcodes(c(r1, r2), short_barcodes, max_distance = 1L, allow_revcomp = FALSE, nthreads = 1L) counts
Write barcode-specific outputs by matching a prefix on the
first read of each record. Each output path is formed by substituting
{barcode} in the supplied format string.
fraq_demux(input, output_format, barcodes, max_distance = 1L, nthreads = 1L)fraq_demux(input, output_format, barcodes, max_distance = 1L, nthreads = 1L)
input |
Character vector of one or more input FASTQ file paths (e.g., R1 and R2). |
output_format |
Character vector of format strings, same length
as |
barcodes |
Character vector of barcode sequences to test as prefixes. |
max_distance |
Integer maximum Hamming distance allowed between barcode and read prefix. |
nthreads |
Integer number of threads. |
When no barcode matches, the literal NO_MATCH is substituted in place of
{barcode}. If multiple barcodes match the same read, MULTI_MATCH is used.
Invisibly, NULL. Files are written to disk according to
output_format.
r1 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 1000, read_length = 100, name_prefix = "read_R1_") out <- tempfile("R1_", fileext = "_{barcode}.fastq") barcodes <- c("ACGT", "TGCA", "GTAC") fraq_demux(r1, out, barcodes, max_distance = 1L, nthreads = 1L)r1 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 1000, read_length = 100, name_prefix = "read_R1_") out <- tempfile("R1_", fileext = "_{barcode}.fastq") barcodes <- c("ACGT", "TGCA", "GTAC") fraq_demux(r1, out, barcodes, max_distance = 1L, nthreads = 1L)
Write deterministically downsampled FASTQ file(s) to disk.
input and output must be vectors of the same length (e.g., R1/R2 pairs).
fraq_downsample(input, output, amount, nthreads = 1L)fraq_downsample(input, output, amount, nthreads = 1L)
input |
Character vector of one or more input FASTQ file paths.
Vectors must be the same length as |
output |
Character vector of output FASTQ file paths, same length
as |
amount |
Numeric scalar in (0, 1); proportion of reads to retain. |
nthreads |
Integer number of threads. |
Downsampling is deterministic: given the same inputs, fraq_downsample()
keeps the same records every run while matching the requested proportion
as closely as possible.
Invisibly returns NULL after writing the downsampled outputs.
r1 <- tempfile(fileext = ".fastq") r2 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 1000, read_length = 100, name_prefix = "read_R1_") generate_random_fastq(r2, n_reads = 1000, read_length = 100, name_prefix = "read_R2_") out <- c(tempfile(fileext = ".fastq"), tempfile(fileext = ".fastq")) fraq_downsample(c(r1, r2), out, amount = 0.1)r1 <- tempfile(fileext = ".fastq") r2 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 1000, read_length = 100, name_prefix = "read_R1_") generate_random_fastq(r2, n_reads = 1000, read_length = 100, name_prefix = "read_R2_") out <- c(tempfile(fileext = ".fastq"), tempfile(fileext = ".fastq")) fraq_downsample(c(r1, r2), out, amount = 0.1)
Report whether the current build of fraq was compiled with
named pipe (FIFO) support. FIFO outputs (paths ending in .fifo) are only
available on Unix-like platforms where the build detected S_IFIFO.
fraq_fifo_supported()fraq_fifo_supported()
The result is determined at compile time; reinstalling the package on an operating system that exposes FIFOs is required to enable support.
Logical scalar indicating whether FIFO inputs/outputs are supported.
if (fraq_fifo_supported()) { message("FIFO streams are available on this platform.") } else { message("Use regular files instead of .fifo paths on this build.") }if (fraq_fifo_supported()) { message("FIFO streams are available on this platform.") } else { message("Use regular files instead of .fifo paths on this build.") }
fraq_mem_list() summarizes the .mem datasets currently stored in the
session. fraq_mem_remove() deletes one or more .mem entries, freeing the
associated memory. fraq_mem_load() is a convenience wrapper around
fraq_convert() that loads on-disk FASTQ/FRAQ inputs into the in-memory
store after validating that the outputs end with .mem.
The in-memory store lives in the current R session. For consistent results,
call the helper functions when no other fraq jobs are actively writing to
the same .mem keys.
fraq_mem_list() fraq_mem_remove(mem_key) fraq_mem_load(input, mem_key, nthreads = 1L)fraq_mem_list() fraq_mem_remove(mem_key) fraq_mem_load(input, mem_key, nthreads = 1L)
mem_key |
Character vector of |
input |
Character vector of FASTQ/FRAQ paths to load into memory. |
nthreads |
Positive integer parallelism for the load. |
fraq_mem_list() returns a data frame with columns mem_key and
n_reads.
fraq_mem_remove() returns a logical vector indicating which keys were
removed.
fraq_mem_load() returns the target .mem keys invisibly, exactly as
supplied.
tmp <- tempfile(fileext = ".fastq") generate_random_fastq(tmp, n_reads = 100, read_length = 75) mem_path <- tempfile(fileext = ".mem") fraq_mem_load(tmp, mem_path) fraq_mem_list() fraq_mem_remove(mem_path)tmp <- tempfile(fileext = ".fastq") generate_random_fastq(tmp, n_reads = 100, read_length = 75) mem_path <- tempfile(fileext = ".mem") fraq_mem_load(tmp, mem_path) fraq_mem_list() fraq_mem_remove(mem_path)
Merge R1/R2 pairs by overlapping sequences (optionally reverse-complementing R2), emitting merged reads and optional unmerged outputs.
fraq_merge_pairs( input, output_merged, output_unmerged = NULL, min_overlap = 12L, max_mismatch_rate = 0.1, consensus_mode = c("max", "mean", "r1", "r2"), trim_overhang = TRUE, revcomp_R2 = TRUE, nthreads = 1L )fraq_merge_pairs( input, output_merged, output_unmerged = NULL, min_overlap = 12L, max_mismatch_rate = 0.1, consensus_mode = c("max", "mean", "r1", "r2"), trim_overhang = TRUE, revcomp_R2 = TRUE, nthreads = 1L )
input |
Character vector of length 2 with input FASTQ paths (R1, R2). |
output_merged |
Character scalar path/key receiving merged single-end reads. |
output_unmerged |
Optional character vector of length 2 for unmerged
R1/R2 outputs. Use |
min_overlap |
Integer minimum overlap required to attempt merging. |
max_mismatch_rate |
Numeric maximum mismatch fraction allowed within the overlap. |
consensus_mode |
Character string controlling consensus base
selection: |
trim_overhang |
Logical; if |
revcomp_R2 |
Logical; if |
nthreads |
Integer number of worker threads. |
Qualities are interpreted as PHRED+33.
A list summarising merge statistics (merged_reads,
unmerged_reads, mean_insert_size, etc.).
r1 <- tempfile(fileext = ".fastq") r2 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 100, read_length = 100, name_prefix = "read_R1_") generate_random_fastq(r2, n_reads = 100, read_length = 100, name_prefix = "read_R2_") out_merged <- tempfile(fileext = ".fastq") fraq_merge_pairs(c(r1, r2), out_merged, output_unmerged = NULL, min_overlap = 20L, max_mismatch_rate = 0.05)r1 <- tempfile(fileext = ".fastq") r2 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 100, read_length = 100, name_prefix = "read_R1_") generate_random_fastq(r2, n_reads = 100, read_length = 100, name_prefix = "read_R2_") out_merged <- tempfile(fileext = ".fastq") fraq_merge_pairs(c(r1, r2), out_merged, output_unmerged = NULL, min_overlap = 20L, max_mismatch_rate = 0.05)
Get or set FRAQ options.
fraq_options(option, value = NULL)fraq_options(option, value = NULL)
option |
Character string name of the option. Valid options are: "blocksize", "fraq_compress_level", "zstd_compress_level", "gzip_compress_level". |
value |
Optional value to set the option to; if NULL, the current value is returned. |
The current option value (if input value is NULL) or
previous option value.
# Get current blocksize fraq_options("blocksize") # # Set blocksize to 16384 fraq_options("blocksize", 16384)# Get current blocksize fraq_options("blocksize") # # Set blocksize to 16384 fraq_options("blocksize", 16384)
Drop read sets when any mate fails the quality thresholds. Qualities are interpreted as PHRED+33.
fraq_quality_filter( input, output, min_mean_quality = 20, max_low_q_bases = .Machine$integer.max, low_q_threshold = 20L, nthreads = 1L )fraq_quality_filter( input, output, min_mean_quality = 20, max_low_q_bases = .Machine$integer.max, low_q_threshold = 20L, nthreads = 1L )
input |
Character vector of one or more input FASTQ file paths. Must
be the same length as |
output |
Character vector of output FASTQ paths. |
min_mean_quality |
Numeric minimum mean base quality required for each mate. |
max_low_q_bases |
Integer maximum number of bases below
|
low_q_threshold |
Integer PHRED cutoff used to count low-quality bases. |
nthreads |
Integer number of worker threads. |
Both thresholds are evaluated separately on every mate. If any mate fails, the entire read set is discarded.
Invisibly, NULL. Reads are written to output paths for records
that pass the filters.
r1 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 1000, read_length = 100, name_prefix = "read_R1_") out <- tempfile(fileext = ".fastq") fraq_quality_filter(r1, out, min_mean_quality = 25, max_low_q_bases = 2L, low_q_threshold = 20L, nthreads = 1L)r1 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 1000, read_length = 100, name_prefix = "read_R1_") out <- tempfile(fileext = ".fastq") fraq_quality_filter(r1, out, min_mean_quality = 25, max_low_q_bases = 2L, low_q_threshold = 20L, nthreads = 1L)
Writes a minimal Rcpp example file showing how to write custom kernels via
Rcpp.
fraq_rcpp_template(output_file)fraq_rcpp_template(output_file)
output_file |
Character path where the C++ source file will be written. |
NULL invisibly.
cpp <- tempfile(fileext = ".cpp") fraq_rcpp_template(cpp) # Rcpp::sourceCpp(cpp) # optionally compile the examplecpp <- tempfile(fileext = ".cpp") fraq_rcpp_template(cpp) # Rcpp::sourceCpp(cpp) # optionally compile the example
Run an R function over blocks of FASTQ records. With
nthreads > 1, fraq keeps file I/O and compression work on background TBB
threads; with nthreads = 1 or after fork, it uses a serial path.
fraq_run_r(input, kernel, limit = NULL, nthreads = 1L)fraq_run_r(input, kernel, limit = NULL, nthreads = 1L)
input |
Character vector of source files/keys. |
kernel |
Function called as |
limit |
Optional non-negative whole-number scalar limiting processing
to the first |
nthreads |
Integer number of threads. When |
The kernel must return NULL or a named list of data frames. Each list name
is an output path or .mem key, and each data frame must contain character
columns name, seq, and qual. Output paths are normalized before
writing; .mem keys are exact in-memory identifiers and are not normalized.
The R kernel runs only on the calling R thread. When nthreads > 1,
background threads are used for reading, joining, demultiplexing,
compression, and writing.
Blocks are delivered to the R kernel in increasing block-index order. Within
each call, index is an increasing vector of zero-based read indices.
Do not use parallel::mclapply() inside the kernel. It forks the R process
on Unix-like systems, and forking while fraq has active background threads can
deadlock or crash. Use vectorized R code inside the kernel; if serial mapping
is needed, use lapply() instead.
Invisibly returns NULL after all outputs are written.
input_paths <- c(tempfile(fileext = ".fastq"), tempfile(fileext = ".fastq")) output_paths <- c(tempfile(fileext = ".fastq"), tempfile(fileext = ".fastq")) generate_random_fastq(input_paths[1], n_reads = 10, read_length = 50) generate_random_fastq(input_paths[2], n_reads = 10, read_length = 50) even_read_kernel <- function(reads, index) { keep <- index %% 2 == 0 filtered_read1 <- reads$read1[keep, , drop = FALSE] filtered_read2 <- reads$read2[keep, , drop = FALSE] output <- list() output[[output_paths[1]]] <- filtered_read1 output[[output_paths[2]]] <- filtered_read2 output } fraq_run_r(input_paths, even_read_kernel, nthreads = 2L)input_paths <- c(tempfile(fileext = ".fastq"), tempfile(fileext = ".fastq")) output_paths <- c(tempfile(fileext = ".fastq"), tempfile(fileext = ".fastq")) generate_random_fastq(input_paths[1], n_reads = 10, read_length = 50) generate_random_fastq(input_paths[2], n_reads = 10, read_length = 50) even_read_kernel <- function(reads, index) { keep <- index %% 2 == 0 filtered_read1 <- reads$read1[keep, , drop = FALSE] filtered_read2 <- reads$read2[keep, , drop = FALSE] output <- list() output[[output_paths[1]]] <- filtered_read1 output[[output_paths[2]]] <- filtered_read2 output } fraq_run_r(input_paths, even_read_kernel, nthreads = 2L)
fraq_export_shortreadq() converts FRAQ/FASTQ inputs into in-memory
ShortReadQ objects via fraq_convert. fraq_import_shortreadq
converts ShortReadQ objects to any supported FRAQ format.
fraq_export_shortreadq(input, nthreads = 1L, tmpdir = tempdir()) fraq_import_shortreadq(shortreadq, output, nthreads = 1L, tmpdir = tempdir())fraq_export_shortreadq(input, nthreads = 1L, tmpdir = tempdir()) fraq_import_shortreadq(shortreadq, output, nthreads = 1L, tmpdir = tempdir())
input |
Character vector of input paths/keys accepted by
|
nthreads |
Positive integer passed to |
tmpdir |
Directory used for staging temporary files. |
shortreadq |
A |
output |
Character vector of destination paths/keys, same length as
|
fraq_export_shortreadq() returns a single ShortReadQ when input has
length 1, otherwise a list of ShortReadQ objects.
fraq_import_shortreadq() invisibly returns output after conversion.
Filesystem paths are normalized; .mem keys are returned exactly as
supplied.
ShortRead::readFastq(), ShortRead::writeFastq(),
fraq_convert()
if (requireNamespace("ShortRead", quietly = TRUE)) { fq <- tempfile(fileext = ".fastq") generate_random_fastq(fq, n_reads = 10, read_length = 50) fraq_path <- tempfile(fileext = ".fraq") fraq_convert(fq, fraq_path) reads <- fraq_export_shortreadq(fraq_path) roundtrip_fastq <- tempfile(fileext = ".fastq") fraq_import_shortreadq(reads, roundtrip_fastq) stopifnot(file.exists(roundtrip_fastq)) }if (requireNamespace("ShortRead", quietly = TRUE)) { fq <- tempfile(fileext = ".fastq") generate_random_fastq(fq, n_reads = 10, read_length = 50) fraq_path <- tempfile(fileext = ".fraq") fraq_convert(fq, fraq_path) reads <- fraq_export_shortreadq(fraq_path) roundtrip_fastq <- tempfile(fileext = ".fastq") fraq_import_shortreadq(reads, roundtrip_fastq) stopifnot(file.exists(roundtrip_fastq)) }
Write a subset of reads from input to output, either the
first limit reads or specific zero-based indices in select.
fraq_slice(input, output, limit = NULL, select = NULL, nthreads = 1L)fraq_slice(input, output, limit = NULL, select = NULL, nthreads = 1L)
input |
Character vector of source files/keys. |
output |
Character vector of destination files/keys, same length as
|
limit |
Numeric scalar; keep the first |
select |
Numeric vector of zero-based indices to keep. Defaults to
|
nthreads |
Integer number of threads for reading/writing. |
Exactly one of limit or select must be supplied.
Invisibly returns NULL after writing the selected reads.
src <- tempfile(fileext = ".fastq") generate_random_fastq(src, n_reads = 10, read_length = 50) dest <- tempfile(fileext = ".fastq") fraq_slice(src, dest, limit = 5)src <- tempfile(fileext = ".fastq") generate_random_fastq(src, n_reads = 10, read_length = 50) dest <- tempfile(fileext = ".fastq") fraq_slice(src, dest, limit = 5)
Compute QC summaries for single- or paired-end FASTQ files. When two inputs are provided, R1 and R2 are summarized separately and an insert-size histogram is reported (estimated from R1 vs reverse-complemented R2 overlap).
fraq_summary( input, phred33 = TRUE, min_overlap = 12L, max_mismatch_rate = 0.1, limit = 0L, nthreads = 1L )fraq_summary( input, phred33 = TRUE, min_overlap = 12L, max_mismatch_rate = 0.1, limit = 0L, nthreads = 1L )
input |
Character vector of length 1 or 2 with input FASTQ paths. Length 1 = single-end; length 2 = paired-end (first element maps to R1, second to R2). |
phred33 |
Logical; TRUE if qualities are PHRED+33, FALSE for PHRED+64. |
min_overlap |
Integer minimum overlap used for insert-size estimation (paired-end only). |
max_mismatch_rate |
Numeric between 0 and 1 (inclusive); maximum allowed mismatch rate within the overlapped region (paired-end only). |
limit |
Numeric cap on the number of read sets to process. Use |
nthreads |
Integer number of threads. |
Outputs per-mate tables:
basic_stats_R{1,2}: total sequences, total bases, min/mean/max
length, GC percent.
per_base_quality_R{1,2}: mean PHRED by 1-based position (with
counts).
per_base_content_R{1,2}: long format base usage by position
(A/C/G/T/N/other).
length_distribution_R{1,2}: histogram of sequence lengths.
avg_read_quality_R{1,2}: histogram of rounded per-read average
quality (columns avg_quality, count).
For paired-end inputs, insert_size is included when overlaps are
found.
A named list of data frames. For single-end: R1-only tables. For
paired-end: R1/R2 tables plus optional insert_size. Each mate includes
basic_stats, per-base quality/content, length distributions, and
average read quality histograms.
# Single-end example r1 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 1000, read_length = 100, name_prefix = "read_R1_") res_se <- fraq_summary(r1, nthreads = 1L) res_se$basic_stats_R1 # Paired-end example r1 <- tempfile(fileext = ".fastq"); r2 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 1000, read_length = 100, name_prefix = "read_R1_") generate_random_fastq(r2, n_reads = 1000, read_length = 100, name_prefix = "read_R2_") res_pe <- fraq_summary(c(r1, r2), nthreads = 1L) res_pe$basic_stats_R1 # dplyr example using pipes # library(dplyr) # res_pe$insert_size %>% arrange(desc(count)) %>% head()# Single-end example r1 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 1000, read_length = 100, name_prefix = "read_R1_") res_se <- fraq_summary(r1, nthreads = 1L) res_se$basic_stats_R1 # Paired-end example r1 <- tempfile(fileext = ".fastq"); r2 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 1000, read_length = 100, name_prefix = "read_R1_") generate_random_fastq(r2, n_reads = 1000, read_length = 100, name_prefix = "read_R2_") res_pe <- fraq_summary(c(r1, r2), nthreads = 1L) res_pe$basic_stats_R1 # dplyr example using pipes # library(dplyr) # res_pe$insert_size %>% arrange(desc(count)) %>% head()
Trim occurrences of adapter sequence(s) at the start of the
first fastq input. input and output must be vectors of the same
length (e.g., R1/R2 pairs).
Adapters will be trimmed only for the first fastq, but all inputs will be
filtered if filter_untrimmed is TRUE.
fraq_trim_adapters( input, output, adapters, max_distance = 1L, filter_untrimmed = TRUE, nthreads = 1L )fraq_trim_adapters( input, output, adapters, max_distance = 1L, filter_untrimmed = TRUE, nthreads = 1L )
input |
Character vector of one or more input FASTQ file paths.
Vectors must be the same length as |
output |
Character vector of output FASTQ file paths, same length as
|
adapters |
Character vector of adapter sequences to trim. Adapters are given priority based on the order they appear. |
max_distance |
Integer maximum number of mismatches for adapter matching. |
filter_untrimmed |
Logical; if TRUE, drop reads with no trim. |
nthreads |
Integer number of threads. |
A data frame of counts of trimmed adapters.
r1 <- tempfile(fileext = ".fastq") r2 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 1000, read_length = 100, name_prefix = "read_R1_") generate_random_fastq(r2, n_reads = 1000, read_length = 100, name_prefix = "read_R2_") out <- c(tempfile(fileext = ".fastq"), tempfile(fileext = ".fastq")) adapters <- c("ACGT", "TGCA", "GTAC") fraq_trim_adapters(c(r1, r2), out, adapters, max_distance = 1L, filter_untrimmed = TRUE, nthreads = 1L)r1 <- tempfile(fileext = ".fastq") r2 <- tempfile(fileext = ".fastq") generate_random_fastq(r1, n_reads = 1000, read_length = 100, name_prefix = "read_R1_") generate_random_fastq(r2, n_reads = 1000, read_length = 100, name_prefix = "read_R2_") out <- c(tempfile(fileext = ".fastq"), tempfile(fileext = ".fastq")) adapters <- c("ACGT", "TGCA", "GTAC") fraq_trim_adapters(c(r1, r2), out, adapters, max_distance = 1L, filter_untrimmed = TRUE, nthreads = 1L)
Creates a synthetic FASTQ file with random DNA sequences and Illumina-like
Phred+33 quality strings (high early-cycle quality with a gentle tail drop).
If output_file ends with .gz, the file is written
gzip-compressed via a connection.
generate_random_fastq( output_file, n_reads = 1e+05, read_length = 100, name_prefix = "read_" )generate_random_fastq( output_file, n_reads = 1e+05, read_length = 100, name_prefix = "read_" )
output_file |
Character vector of length 1 (single-end) or 2 (paired-end)
outputs. |
n_reads |
Integer number of reads to generate. Default |
read_length |
Integer read length (number of bases per read). Default |
name_prefix |
Character prefix for read names. Default |
Each read comprises four lines: header, sequence, +, and quality.
Headers are generated as @<name_prefix><index>. Sequences are sampled uniformly
from ACGT. Qualities follow a tapered profile that starts near Q37 and falls
toward the low 30s, with occasional low-quality spikes to mimic typical
Illumina output.
Invisibly returns the path(s) written in output_file.
# Example: small test files tmp_fastq <- tempfile(fileext = ".fastq") tmp_fastq_gz <- tempfile(fileext = ".fastq.gz") # Create plain FASTQ (500 reads, length 100) generate_random_fastq(tmp_fastq, n_reads = 500, read_length = 100) # Create gzipped FASTQ (500 reads, length 100) generate_random_fastq(tmp_fastq_gz, n_reads = 500, read_length = 100) # Paired-end example with overlapping mates generate_random_fastq(c(tmp_fastq, tmp_fastq_gz), n_reads = 100, read_length = 150)# Example: small test files tmp_fastq <- tempfile(fileext = ".fastq") tmp_fastq_gz <- tempfile(fileext = ".fastq.gz") # Create plain FASTQ (500 reads, length 100) generate_random_fastq(tmp_fastq, n_reads = 500, read_length = 100) # Create gzipped FASTQ (500 reads, length 100) generate_random_fastq(tmp_fastq_gz, n_reads = 500, read_length = 100) # Paired-end example with overlapping mates generate_random_fastq(c(tmp_fastq, tmp_fastq_gz), n_reads = 100, read_length = 150)