An Ultra-Fast All-in-One FASTQ preprocessor

Introduction

The Rfastp package provides an interface to the all-in-one preprocessing for FastQ files toolkit fastp(Chen et al. 2018).

Installation

Use the BiocManager package to download and install the package from Bioconductor as follows:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Rfastp")

If required, the latest development version of the package can also be installed from GitHub.

BiocManager::install("remotes")
BiocManager::install("RockefellerUniversity/Rfastp")

Once the package is installed, load it into your R session:

library(Rfastp)

FastQ Quality Control with rfastp

The package contains three example fastq files, corresponding to a single-end fastq file, a pair of paired-end fastq files.

se_read1 <- system.file("extdata","Fox3_Std_small.fq.gz",package="Rfastp")
pe_read1 <- system.file("extdata","reads1.fastq.gz",package="Rfastp")
pe_read2 <- system.file("extdata","reads2.fastq.gz",package="Rfastp")
outputPrefix <- tempfile(tmpdir = tempdir())

a normal QC run for single-end fastq file.

Rfastp support multiple threads, set threads number by parameter thread.

se_json_report <- rfastp(read1 = se_read1, 
    outputFastq = paste0(outputPrefix, "_se"), thread = 4)

a normal QC run for paired-end fastq files.

pe_json_report <- rfastp(read1 = pe_read1, read2 = pe_read2,
    outputFastq = paste0(outputPrefix, "_pe"))

merge paired-end fastq files after QC.

pe_merge_json_report <- rfastp(read1 = pe_read1, read2 = pe_read2, merge = TRUE,
    outputFastq = paste0(outputPrefix, '_unpaired'),
    mergeOut = paste0(outputPrefix, "_merged.fastq.gz"))

UMI processing

a normal UMI processing for 10X Single-Cell library.

umi_json_report <- rfastp(read1 = pe_read1, read2 = pe_read2, 
    outputFastq = paste0(outputPrefix, '_umi1'), umi = TRUE, umiLoc = "read1",
    umiLength = 16)

Set a customized UMI prefix and location in sequence name.

the following example will add prefix string before the UMI sequence in the sequence name. An “_” will be added between the prefix string and UMI sequence. The UMI sequences will be inserted into the sequence name before the first space.

umi_json_report <- rfastp(read1 = pe_read1, read2 = pe_read2, 
    outputFastq = paste0(outputPrefix, '_umi2'), umi = TRUE, umiLoc = "read1",
    umiLength = 16, umiPrefix = "#", umiNoConnection = TRUE, 
    umiIgnoreSeqNameSpace = TRUE)

A QC example with customized cutoffs and adapter sequence.

Trim poor quality bases at 3’ end base by base with quality higher than 5; trim poor quality bases at 5’ end by a 29bp window with mean quality higher than 20; disable the polyG trimming, specify the adapter sequence for read1.

clipr_json_report <- rfastp(read1 = se_read1, 
    outputFastq = paste0(outputPrefix, '_clipr'),
    disableTrimPolyG = TRUE,
    cutLowQualFront = TRUE,
    cutFrontWindowSize = 29,
    cutFrontMeanQual = 20,
    cutLowQualTail = TRUE,
    cutTailWindowSize = 1,
    cutTailMeanQual = 5,
    minReadLength = 29,
    adapterSequenceRead1 = 'GTGTCAGTCACTTCCAGCGG'
)

multiple input files for read1/2 in a vector.

rfastq can accept multiple input files, and it will concatenate the input files into one and the run fastp.

pe001_read1 <- system.file("extdata","splited_001_R1.fastq.gz",
    package="Rfastp")
pe002_read1 <- system.file("extdata","splited_002_R1.fastq.gz",
    package="Rfastp")
pe003_read1 <- system.file("extdata","splited_003_R1.fastq.gz",
    package="Rfastp")
pe004_read1 <- system.file("extdata","splited_004_R1.fastq.gz",
    package="Rfastp")
inputfiles <- c(pe001_read1, pe002_read1, pe003_read1, pe004_read1)
cat_rjson_report <- rfastp(read1 = inputfiles, 
    outputFastq = paste0(outputPrefix, "_merged1"))

concatenate multiple fastq files.

catfastq concatenate all the input files into a new file.

pe001_read2 <- system.file("extdata","splited_001_R2.fastq.gz",
    package="Rfastp")
pe002_read2 <- system.file("extdata","splited_002_R2.fastq.gz",
    package="Rfastp")
pe003_read2 <- system.file("extdata","splited_003_R2.fastq.gz",
    package="Rfastp")
pe004_read2 <- system.file("extdata","splited_004_R2.fastq.gz",
    package="Rfastp")
inputR2files <- c(pe001_read2, pe002_read2, pe003_read2, pe004_read2)
catfastq(output = paste0(outputPrefix,"_merged2_R2.fastq.gz"), 
    inputFiles = inputR2files)

Generate report tables/plots

A data frame for the summary.

dfsummary <- qcSummary(pe_json_report)

a ggplot2 object of base quality plot.

p1 <- curvePlot(se_json_report)
p1

a ggplot2 object of GC Content plot.

p2 <- curvePlot(se_json_report, curve="content_curves")
p2

a data frame for the trimming summary.

dfTrim <- trimSummary(pe_json_report)

Miscellaneous helper functions

usage of rfastp:

?rfastp

usage of catfastq:

?catfastq

usage of qcSummary:

?qcSummary

usage of trimSummary:

?trimSummary

usage of curvePlot:

?curvePlot

Acknowledgments

Thank you to Ji-Dung Luo for testing/vignette review/critical feedback, Doug Barrows for critical feedback/vignette review and Ziwei Liang for their support. # Session info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Rfastp_1.15.0    BiocStyle_2.33.1
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5        jsonlite_1.8.8      highr_0.11          rjson_0.2.22       
##  [5] compiler_4.4.1      BiocManager_1.30.25 Rcpp_1.0.13         stringr_1.5.1      
##  [9] jquerylib_0.1.4     scales_1.3.0        yaml_2.3.10         fastmap_1.2.0      
## [13] ggplot2_3.5.1       R6_2.5.1            plyr_1.8.9          labeling_0.4.3     
## [17] knitr_1.48          tibble_3.2.1        maketools_1.3.0     munsell_0.5.1      
## [21] bslib_0.8.0         pillar_1.9.0        rlang_1.1.4         utf8_1.2.4         
## [25] cachem_1.1.0        stringi_1.8.4       xfun_0.47           sass_0.4.9         
## [29] sys_3.4.2           cli_3.6.3           withr_3.0.1         magrittr_2.0.3     
## [33] digest_0.6.37       grid_4.4.1          lifecycle_1.0.4     vctrs_0.6.5        
## [37] evaluate_0.24.0     glue_1.7.0          farver_2.1.2        buildtools_1.0.0   
## [41] fansi_1.0.6         colorspace_2.1-1    reshape2_1.4.4      rmarkdown_2.28     
## [45] tools_4.4.1         pkgconfig_2.0.3     htmltools_0.5.8.1

References

Chen, Shifu, Yanqing Zhou, Yaru Chen, and Jia Gu. 2018. fastp: an ultra-fast all-in-one FASTQ preprocessor.” Bioinformatics 34 (17): i884–90. https://doi.org/10.1093/bioinformatics/bty560.