RNA-seq has become a state-of-art technology for studying gene expression (Z. Wang, Gerstein, and Snyder 2009). However, due to improper RNA preparation and choice of some library preparation protocols, such as rRNA-depletion based RNA-seq protocol (O’Neil, Glowatz, and Schlumpberger 2013) and the SMART-Seq protocol, RNA-seq data might suffer from genomic DNA (gDNA) contamination, which skews quantitation of gene expression and hinders differential gene expression analysis (Li et al. 2022; Verwilt et al. 2020; Markou et al. 2021). Some quality control tools have been developed to check the quality of RNA-seq data at the raw read and post-alignment levels, including FastQC, RSeQC (L. Wang, Wang, and Li 2012), Qualimap (García-Alcalde et al. 2012), RNA-SeQC/RNA-SeQC 2 (DeLuca et al. 2012; Graubert et al. 2021), QoRTs (Hartley and Mullikin 2015), RNA-QC-Chain (Zhou et al. 2018), and RNAseqQC. Those post-alignment tools can report percentages of reads mapping to different genomic features, such as genes, exons, introns, intergenic regions, and rRNA exons. Thus, they can be used to roughly detect gDNA contamination. So far, SeqMonk and the gDNAx package are the only tool which can be used to both detect and correct for gDNA contamination in RNA-seq data. However, SeqMonk is a Java-based GUI tool which makes it not available in most high performance computing cluster. More importantly, seqMonk assumes a uniform distribution of reads derived from gDNA contamination and its performance on correcting for gDNA contamination in RNA-seq data is not fully evaluated and peer reviewed. On the other hand, gDNAx is an R/Bioconductor package and can be incorporated into automated RNA-seq data analysis pipeline easily. However, gDNAx only simply removes reads mapping to intergenic regions and introns, but not those mapping to exons, to mitigate gDNA contamination. Thus, gDNAx’s algorithm for correcting for gDNA contamination is not sophisticated. To this end, we developed the R pacakge CleanUpRNAseq, which provides a full set of functions for detecting and correcting for gDNA contamination in RNA-seq data across all computing platforms.
As for any other sequencing data analysis, users should first check the quality of raw RNA-seq sequencing data using a tool like FastQC followed by MultiQC (Ewels et al. 2016). Depending on the quality of the raw data, trimming low quality bases and/or adpator sequences might be necessary to increase mapping quality and rate. Subsequently, reads are mapped to the reference genome of interest using tools, such as STAR (Dobin et al. 2012), HISAT2 (Kim et al. 2019), GMAP (Wu et al. 2016), and Subread (Liao, Smyth, and Shi 2013). The resulting alignment files in the BAM format are used for post-alignment quality control, including detection of gDNA contamination. Meanwhile, if the RNA-seq library is unstranded, reads are mapped to the reference transcriptome using Salmon (Patro et al. 2017) to get transcript-level abundance, counts, and effective lengths, since Salmon can resolve reads mapped to multiple transcripts, which are imported into R using the tximport to get gene-level counts, abundance and length information. However, if the library is stranded, reads are mapped to the reference transcriptome using Salmon twice: one using the designed strandedness for the ‘–libType’ option, the other using the opposite strandedness. See Salmon library type discussion https://salmon.readthedocs.io/en/latest/library_type.html. Then gene-level counts, abundance and length information are imported into R as above.
First, load the required packages, including CleanUpRNAseq. Then users can perform a step-by-step analysis of the RNA-seq data as below for more flexibility. For users’ convenience, CleanUpRNAseq also offers two wrapper functions, create_diagnostic_plots and correct_for_contamination, for detecting and correcting for gDNA contamination in RNA-seq data. As for how to use these wrappers, please refer to their function documentation.
suppressPackageStartupMessages({
library("CleanUpRNAseq")
#devtools::load_all("../../CleanUpRNAseq")
library("ggplotify")
library("patchwork")
library("ensembldb")
library("utils")
})
Users can list all current EnsDb packages from AnnotationHub and load the package of choice, if available, as follows. Here, an EnsDb package for the human genome is loaded. It is crucial to use an EnsDb pacakge which represents the genome annotation file (GTF) used for RNA-seq read alignment.
suppressPackageStartupMessages({
library("EnsDb.Hsapiens.v86")
})
hs_ensdb_sqlite <- EnsDb.Hsapiens.v86
Otherwise, users can easily prepare an EnsDb database from an Ensembl GTF file. For all following steps, this option is adopted because the latest version of human transcriptome has been used for read mapping by STAR and Salmon to quantify gene expression of the example RNA-seq data.
options(timeout = max(3000, getOption("timeout")))
tmp_dir <- tempdir()
gtf <- system.file("extdata", "example.gtf.gz",
package = "CleanUpRNAseq")
hs_ensdb_sqlite <-
ensDbFromGtf(
gtf = gtf,
outfile = file.path(tmp_dir, "EnsDb.hs.v110.sqlite"),
organism = "Homo_Sapiens",
genomeVersion = "GRCh38",
version = 110
)
Potential DNA contamination exists if a significantly high portion of RNA-seq reads mapped to intergenic regions. CleanUpRNAseq uses the featureCounts function from the Rsubread package to quantify the percentage of reads mapped to different genomic features: genes, exons, introns, intergenic regions, rRNA exons, and organelle genome(s). So users need to prepare SAF files for these genomic features. In addition, a BAM file is needed to provide the lengths of the chromosomes/scaffolds.
Reads mapped to different genomic features is summarized by using the featureCounts function with the SAF files generated above as annotation. CleanUpRNAseq also quantifies the reads spanning exon-exon junctions and the reads mapped to exons of each gene using the GTF file as annotation. The junction read count table is used to determine the unexpressed, multiexonic genes, while the gene-level read count table is used for comparing samples at the gene level. Here two downsampled BAM files are used for speeding up the demonstration, while a precomputed R object in the .RData format is used for actual downstream analysis.
tmp_dir <- tempdir()
in_dir <- system.file("extdata", package = "CleanUpRNAseq")
gtf.gz <- dir(in_dir, ".gtf.gz$", full.name = TRUE)
gtf <- file.path(tmp_dir, gsub("\\.gz", "", basename(gtf.gz)))
R.utils::gunzip(gtf.gz, destname= gtf,
overwrite = TRUE, remove = FALSE)
in_dir <- system.file("extdata", package = "CleanUpRNAseq")
BAM_file <- dir(in_dir, ".bam$", full.name = TRUE)
salmon_quant_file <- dir(in_dir, ".sf$", full.name = TRUE)
sample_name = gsub(".+/(.+?).srt.bam", "\\1", BAM_file)
salmon_quant_file_opposite_strand <- salmon_quant_file
col_data <- data.frame(sample_name = sample_name,
BAM_file = BAM_file,
salmon_quant_file = salmon_quant_file,
salmon_quant_file_opposite_strand =
salmon_quant_file_opposite_strand,
group = c("CD1N", "CD1P"))
sc <- create_summarizedcounts(lib_strand = 0, colData = col_data)
## featurecounts
capture.output({counts_list <- summarize_reads(
SummarizedCounts = sc,
saf_list = saf_list,
gtf = gtf,
threads = 1,
verbose = TRUE
)}, file = tempfile())
## load salmon quant
salmon_counts <- salmon_tximport(
SummarizedCounts = sc,
ensdb_sqlite = hs_ensdb_sqlite
)
Precomputed R object in the .RData format
(feature_counts_list.rda, and salmon_quant.rda)
containing the featureCounts output and Salmon
quantification output imported by using the tximport
function for a RNA-seq dataset of 8 samples from two treatment groups
are used for demo. GC-content for genes and intergenic regions
(intergenic_GC.rda, and gene_GC.rda) were also
precomputed and used for demo.
data("feature_counts_list")
data("salmon_quant")
data("intergenic_GC")
data("gene_GC")
lib_strand <- 0
col_data_f <- system.file("extdata", "example.colData.txt",
package = "CleanUpRNAseq")
col_data <- read.delim(col_data_f, as.is = TRUE)
## create fake bam files
tmp_dir <- tempdir()
bamfiles <- gsub(".+/", "", col_data$BAM_file)
null <- lapply(file.path(tmp_dir, bamfiles), file.create)
## create fake quant.sf files
quant_sf <- file.path(tmp_dir, gsub(".srt.bam$",
"quant.sf",
bamfiles))
null <- lapply(quant_sf, file.create)
col_data$BAM_file <- file.path(tmp_dir, bamfiles)
col_data$salmon_quant_file <- quant_sf
## pretend this is stranded RA=NA-seq data
col_data$salmon_quant_file_opposite_strand <- quant_sf
sc <- create_summarizedcounts(lib_strand, col_data)
sc$set_feature_counts(feature_counts_list)
sc$set_salmon_quant(salmon_quant)
sc$set_salmon_quant_opposite(salmon_quant)
p_assignment_stat <-plot_assignment_stat(SummarizedCounts = sc)
wrap_plots(p_assignment_stat)
assigned_per_region <- get_region_stat(SummarizedCounts = sc)
p <- plot_read_distr(assigned_per_region)
p
p_expr_distr <- plot_expr_distr(
SummarizedCounts = sc,
normalization = "DESeq2"
)
wrap_plots(p_expr_distr, ncol = 1, nrow = 3)
## DESeq2 exploratory analysis before correction
p<- plot_pca_heatmap(SummarizedCounts = sc,
silent = TRUE)
p$pca + as.ggplot(p$heatmap)
If the libraries are unstranded, CleanUpRNAseq uses a median per-base read coverage (median_PBRC) over non-zero count intergenic regions to estimate per-base DNA contamination over exons of each gene of each sample, and corrects for gene-level DNA contamination by subtracting median_PBRC * average_gene_length from the Salmon count table of each gene of each sample.
Alternatively, for unstranded RNA-seq data, CleanUpRNAseq offers a correction method leveraging GC-content bias on PCR amplification.
gc_correction <-
correct_GC(
SummarizedCounts = sc,
gene_gc = gene_GC,
intergenic_gc = intergenic_GC,
plot = FALSE
)
However, if the libraries are stranded, CleanUpRNAseq
considered as gene-level contamination the count table resulted from
quantitation using the opposite strandedness setting. By subtracting the
DNA contamination of each gene in each sample from the count table
resulted from quantitation using the actual strandedness setting, users
can get contamination corrected count table. To this end, use the
correct_stranded
function.
After correcting for DNA contamination, we expect to see more comparable gene expression across samples.
Boxplots, density plots and empirical cumulative distribution after correction revealed gene expression across samples are more comparable.
Here is the output of sessionInfo()
on the system on
which this document was compiled running pandoc 3.2.1:
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ensembldb_2.31.0 AnnotationFilter_1.31.0 GenomicFeatures_1.59.1
## [4] AnnotationDbi_1.69.0 Biobase_2.67.0 GenomicRanges_1.59.1
## [7] GenomeInfoDb_1.43.1 IRanges_2.41.1 S4Vectors_0.45.2
## [10] BiocGenerics_0.53.3 generics_0.1.3 patchwork_1.3.0
## [13] ggplotify_0.1.2 CleanUpRNAseq_1.1.1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] rstudioapi_0.17.1 jsonlite_1.8.9
## [5] tximport_1.35.0 magrittr_2.0.3
## [7] farver_2.1.2 rmarkdown_2.29
## [9] fs_1.6.5 BiocIO_1.17.0
## [11] zlibbioc_1.52.0 vctrs_0.6.5
## [13] memoise_2.0.1 Rsamtools_2.23.0
## [15] RCurl_1.98-1.16 base64enc_0.1-3
## [17] htmltools_0.5.8.1 S4Arrays_1.7.1
## [19] curl_6.0.1 gridGraphics_0.5-1
## [21] SparseArray_1.7.2 Formula_1.2-5
## [23] sass_0.4.9 KernSmooth_2.23-24
## [25] bslib_0.8.0 htmlwidgets_1.6.4
## [27] plyr_1.8.9 cachem_1.1.0
## [29] buildtools_1.0.0 GenomicAlignments_1.43.0
## [31] lifecycle_1.0.4 pkgconfig_2.0.3
## [33] Matrix_1.7-1 R6_2.5.1
## [35] fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [37] MatrixGenerics_1.19.0 digest_0.6.37
## [39] colorspace_2.1-1 DESeq2_1.47.1
## [41] Hmisc_5.2-0 RSQLite_2.3.8
## [43] labeling_0.4.3 fansi_1.0.6
## [45] httr_1.4.7 abind_1.4-8
## [47] mgcv_1.9-1 compiler_4.4.2
## [49] withr_3.0.2 bit64_4.5.2
## [51] htmlTable_2.4.3 backports_1.5.0
## [53] BiocParallel_1.41.0 DBI_1.2.3
## [55] R.utils_2.12.3 DelayedArray_0.33.2
## [57] rjson_0.2.23 tools_4.4.2
## [59] foreign_0.8-87 nnet_7.3-19
## [61] R.oo_1.27.0 glue_1.8.0
## [63] restfulr_0.0.15 nlme_3.1-166
## [65] grid_4.4.2 checkmate_2.3.2
## [67] cluster_2.1.6 reshape2_1.4.4
## [69] sva_3.55.0 gtable_0.3.6
## [71] BSgenome_1.75.0 R.methodsS3_1.8.2
## [73] qsmooth_1.23.0 data.table_1.16.2
## [75] utf8_1.2.4 XVector_0.47.0
## [77] ggrepel_0.9.6 pillar_1.9.0
## [79] stringr_1.5.1 yulab.utils_0.1.8
## [81] limma_3.63.2 genefilter_1.89.0
## [83] splines_4.4.2 lattice_0.22-6
## [85] survival_3.7-0 rtracklayer_1.67.0
## [87] bit_4.5.0 annotate_1.85.0
## [89] locfit_1.5-9.10 maketools_1.3.1
## [91] Biostrings_2.75.1 knitr_1.49
## [93] gridExtra_2.3 ProtGenerics_1.39.0
## [95] edgeR_4.5.0 SummarizedExperiment_1.37.0
## [97] xfun_0.49 statmod_1.5.0
## [99] matrixStats_1.4.1 pheatmap_1.0.12
## [101] stringi_1.8.4 UCSC.utils_1.3.0
## [103] lazyeval_0.2.2 yaml_2.3.10
## [105] evaluate_1.0.1 codetools_0.2-20
## [107] tibble_3.2.1 BiocManager_1.30.25
## [109] cli_3.6.3 rpart_4.1.23
## [111] xtable_1.8-4 munsell_0.5.1
## [113] jquerylib_0.1.4 Rsubread_2.21.1
## [115] Rcpp_1.0.13-1 png_0.1-8
## [117] XML_3.99-0.17 parallel_4.4.2
## [119] ggplot2_3.5.1 blob_1.2.4
## [121] bitops_1.0-9 scales_1.3.0
## [123] crayon_1.5.3 rlang_1.1.4
## [125] KEGGREST_1.47.0