--- title: "CleanUpRNAseq: detecting and correcting gDNA contamination in RNA-seq data" author: - name: Haibo Liu affiliation: MCCB, UMass Chan Medical School, Worcester, USA - name: Kai Hu affiliation: MCCB, UMass Chan Medical School, Worcester, USA - name: Kevin O'Connor affiliation: MCCB, UMass Chan Medical School, Worcester, USA - name: Michelle Kelliher affiliation: MCCB, UMass Chan Medical School, Worcester, USA - name: Lihua Julie Zhu affiliation: MCCB, UMass Chan Medical School, Worcester, USA date: "`r Sys.Date()`" package: "CleanUpRNAseq" output: BiocStyle::html_document: toc_float: true link-citations: yes bibliography: bibliography.bib abstract: | Some RNA-seq data might suffer from genomic DNA (gDNA) contamination due to carrying over of residual gDNA in RNA preparation into sequencing library. The R package CleanUpRNAseq provides a set of functionalities to detect and correct for gDNA contamination, thus facilitate better quantitation of gene expression and differential expression analysis. vignette: | %\VignetteIndexEntry{CleanUpRNAseq: detecting and correcting for DNA contamination\nin RNA-seq data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} options(timeout = Inf) knitr::opts_chunk$set( eval = TRUE, message = FALSE, warning = FALSE, tidy = FALSE, fig.align = 'center' ) ``` # Introduction RNA-seq has become a state-of-art technology for studying gene expression [@Wang2009]. However, due to improper RNA preparation and choice of some library preparation protocols, such as rRNA-depletion based RNA-seq protocol [@ONeil2013] and the [SMART-Seq](https://t.ly/sMZho) protocol, RNA-seq data might suffer from genomic DNA (gDNA) contamination, which skews quantitation of gene expression and hinders differential gene expression analysis [@Li2022;@Verwilt2020;@Markou2021]. Some quality control tools have been developed to check the quality of RNA-seq data at the raw read and post-alignment levels, including [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), RSeQC [@Wang2012], Qualimap [@GarcaAlcalde2012], RNA-SeQC/RNA-SeQC 2 [@DeLuca2012;@Graubert2021], QoRTs [@Hartley2015], RNA-QC-Chain [@Zhou2018], and [RNAseqQC](https://cran.r-project.org/web/packages/RNAseqQC/index.html). Those post-alignment tools can report percentages of reads mapping to different genomic features, such as genes, exons, introns, intergenic regions, and rRNA exons. Thus, they can be used to roughly detect gDNA contamination. So far, [SeqMonk](https://www.bioinformatics.babraham.ac.uk/projects/seqmonk/) and the gDNAx package are the only tool which can be used to both detect and correct for gDNA contamination in RNA-seq data. However, SeqMonk is a Java-based GUI tool which makes it not available in most high performance computing cluster. More importantly, seqMonk assumes a uniform distribution of reads derived from gDNA contamination and its performance on correcting for gDNA contamination in RNA-seq data is not fully evaluated and peer reviewed. On the other hand, gDNAx is an R/Bioconductor package and can be incorporated into automated RNA-seq data analysis pipeline easily. However, gDNAx only simply removes reads mapping to intergenic regions and introns, but not those mapping to exons, to mitigate gDNA contamination. Thus, gDNAx's algorithm for correcting for gDNA contamination is not sophisticated. To this end, we developed the R pacakge `r Biocpkg("CleanUpRNAseq")`, which provides a full set of functions for detecting and correcting for gDNA contamination in RNA-seq data across all computing platforms. # Setting up As for any other sequencing data analysis, users should first check the quality of raw RNA-seq sequencing data using a tool like [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) followed by MultiQC [@Ewels2016]. Depending on the quality of the raw data, trimming low quality bases and/or adpator sequences might be necessary to increase mapping quality and rate. Subsequently, reads are mapped to the reference genome of interest using tools, such as STAR [@Dobin2012], HISAT2 [@Kim2019], GMAP [@Wu2016], and Subread [@Liao2013]. The resulting alignment files in the BAM format are used for post-alignment quality control, including detection of gDNA contamination. Meanwhile, if the RNA-seq library is *unstranded*, reads are mapped to the reference transcriptome using Salmon [@Patro2017] to get transcript-level abundance, counts, and effective lengths, since Salmon can resolve reads mapped to multiple transcripts, which are imported into R using the `r Biocpkg("tximport")` to get gene-level counts, abundance and length information. However, if the library is *stranded*, reads are mapped to the reference transcriptome using Salmon twice: one using the designed strandedness for the '--libType' option, the other using the opposite strandedness. See Salmon library type discussion . Then gene-level counts, abundance and length information are imported into R as above. # How to run CleanUpRNAseq First, load the required packages, including CleanUpRNAseq. Then users can perform a step-by-step analysis of the RNA-seq data as below for more flexibility. For users' convenience, CleanUpRNAseq also offers two wrapper functions, create_diagnostic_plots and correct_for_contamination, for detecting and correcting for gDNA contamination in RNA-seq data. As for how to use these wrappers, please refer to their function documentation. ```{r load_package, eval = TRUE} suppressPackageStartupMessages({ library("CleanUpRNAseq") #devtools::load_all("../../CleanUpRNAseq") library("ggplotify") library("patchwork") library("ensembldb") library("utils") }) ``` ## Step 1: Load an EnsDb package or prepare an EnsDb database Users can list all current EnsDb packages from `r Biocpkg("AnnotationHub")` and load the package of choice, if available, as follows. Here, an EnsDb package for the human genome is loaded. It is crucial to use an EnsDb pacakge which represents the genome annotation file (GTF) used for RNA-seq read alignment. ```{r load_ensdb, eval=FALSE} suppressPackageStartupMessages({ library("EnsDb.Hsapiens.v86") }) hs_ensdb_sqlite <- EnsDb.Hsapiens.v86 ``` Otherwise, users can easily prepare an EnsDb database from an Ensembl GTF file. For all following steps, this option is adopted because the latest version of human transcriptome has been used for read mapping by STAR and Salmon to quantify gene expression of the example RNA-seq data. ```{r create_ensdb} options(timeout = max(3000, getOption("timeout"))) tmp_dir <- tempdir() gtf <- system.file("extdata", "example.gtf.gz", package = "CleanUpRNAseq") hs_ensdb_sqlite <- ensDbFromGtf( gtf = gtf, outfile = file.path(tmp_dir, "EnsDb.hs.v110.sqlite"), organism = "Homo_Sapiens", genomeVersion = "GRCh38", version = 110 ) ``` ## Step 2. Prepare SAF (simplified annotation format) files Potential DNA contamination exists if a significantly high portion of RNA-seq reads mapped to intergenic regions. `r Biocpkg("CleanUpRNAseq")` uses the *featureCounts* function from the `r Biocpkg("Rsubread")` package to quantify the percentage of reads mapped to different genomic features: genes, exons, introns, intergenic regions, rRNA exons, and organelle genome(s). So users need to prepare SAF files for these genomic features. In addition, a BAM file is needed to provide the lengths of the chromosomes/scaffolds. ```{r prepare_saf} bam_file <- system.file("extdata", "K084CD7PCD1N.srt.bam", package = "CleanUpRNAseq" ) saf_list <- get_saf( ensdb_sqlite = hs_ensdb_sqlite, bamfile = bam_file, mitochondrial_genome = "MT" ) ``` ## Step 3. Summarize reads mapped to different genomic features Reads mapped to different genomic features is summarized by using the *featureCounts* function with the SAF files generated above as annotation. `r Biocpkg("CleanUpRNAseq")` also quantifies the reads spanning exon-exon junctions and the reads mapped to exons of each gene using the GTF file as annotation. The junction read count table is used to determine the unexpressed, multiexonic genes, while the gene-level read count table is used for comparing samples at the gene level. Here two downsampled BAM files are used for speeding up the demonstration, while a precomputed R object in the .RData format is used for actual downstream analysis. ```{r load_data} tmp_dir <- tempdir() in_dir <- system.file("extdata", package = "CleanUpRNAseq") gtf.gz <- dir(in_dir, ".gtf.gz$", full.name = TRUE) gtf <- file.path(tmp_dir, gsub("\\.gz", "", basename(gtf.gz))) R.utils::gunzip(gtf.gz, destname= gtf, overwrite = TRUE, remove = FALSE) in_dir <- system.file("extdata", package = "CleanUpRNAseq") BAM_file <- dir(in_dir, ".bam$", full.name = TRUE) salmon_quant_file <- dir(in_dir, ".sf$", full.name = TRUE) sample_name = gsub(".+/(.+?).srt.bam", "\\1", BAM_file) salmon_quant_file_opposite_strand <- salmon_quant_file col_data <- data.frame(sample_name = sample_name, BAM_file = BAM_file, salmon_quant_file = salmon_quant_file, salmon_quant_file_opposite_strand = salmon_quant_file_opposite_strand, group = c("CD1N", "CD1P")) sc <- create_summarizedcounts(lib_strand = 0, colData = col_data) ## featurecounts capture.output({counts_list <- summarize_reads( SummarizedCounts = sc, saf_list = saf_list, gtf = gtf, threads = 1, verbose = TRUE )}, file = tempfile()) ## load salmon quant salmon_counts <- salmon_tximport( SummarizedCounts = sc, ensdb_sqlite = hs_ensdb_sqlite ) ``` ## Step 4. Check DNA contamination Precomputed R object in the .RData format (*feature_counts_list.rda*, and *salmon_quant.rda*) containing the *featureCounts* output and *Salmon quantification* output imported by using the `tximport` function for a RNA-seq dataset of 8 samples from two treatment groups are used for demo. GC-content for genes and intergenic regions (*intergenic_GC.rda*, and *gene_GC.rda*) were also precomputed and used for demo. ```{r plot_assignment_stat} #| mappingstat, fig.height = 6, fig.width = 6, #| fig.cap = "Read mapping status" data("feature_counts_list") data("salmon_quant") data("intergenic_GC") data("gene_GC") lib_strand <- 0 col_data_f <- system.file("extdata", "example.colData.txt", package = "CleanUpRNAseq") col_data <- read.delim(col_data_f, as.is = TRUE) ## create fake bam files tmp_dir <- tempdir() bamfiles <- gsub(".+/", "", col_data$BAM_file) null <- lapply(file.path(tmp_dir, bamfiles), file.create) ## create fake quant.sf files quant_sf <- file.path(tmp_dir, gsub(".srt.bam$", "quant.sf", bamfiles)) null <- lapply(quant_sf, file.create) col_data$BAM_file <- file.path(tmp_dir, bamfiles) col_data$salmon_quant_file <- quant_sf ## pretend this is stranded RA=NA-seq data col_data$salmon_quant_file_opposite_strand <- quant_sf sc <- create_summarizedcounts(lib_strand, col_data) sc$set_feature_counts(feature_counts_list) sc$set_salmon_quant(salmon_quant) sc$set_salmon_quant_opposite(salmon_quant) p_assignment_stat <-plot_assignment_stat(SummarizedCounts = sc) wrap_plots(p_assignment_stat) ``` ```{r} #| readdistr, fig.height = 4, fig.width = 5, #| fig.cap = "Read distribution over different genomic features" assigned_per_region <- get_region_stat(SummarizedCounts = sc) p <- plot_read_distr(assigned_per_region) p ``` ```{r plot_sample_corr} #| samplecorr, fig.height = 8, fig.width = 8, #| fig.cap = "Smoothed scatter plot showing pairwise sample correlation" plot_sample_corr(SummarizedCounts = sc) ``` ```{r plot_read_distr} #| exprdistr, fig.height = 8, fig.width = 5, #| fig.cap = "Expression distribution" p_expr_distr <- plot_expr_distr( SummarizedCounts = sc, normalization = "DESeq2" ) wrap_plots(p_expr_distr, ncol = 1, nrow = 3) ``` ```{r percent_expressed_gene} #| exprgene, fig.height = 3, fig.width = 3.5, out.width = "80%", #| fig.cap = "Percent of expressed genes" plot_gene_content( SummarizedCounts = sc, min_cpm = 1, min_tpm =1 ) ``` ```{r pca_heatmap} #| samplesimilarity, fig.height = 5, fig.width = 10, #| fig.cap = "PCA score plot and heatmap showing sample similarity" ## DESeq2 exploratory analysis before correction p<- plot_pca_heatmap(SummarizedCounts = sc, silent = TRUE) p$pca + as.ggplot(p$heatmap) ``` ## Step 5. Correct for DNA contamination If the libraries are unstranded, `r Biocpkg("CleanUpRNAseq")` uses a median per-base read coverage (median_PBRC) over non-zero count intergenic regions to estimate per-base DNA contamination over exons of each gene of each sample, and corrects for gene-level DNA contamination by subtracting median_PBRC * average_gene_length from the Salmon count table of each gene of each sample. ```{r global_correction} global_correction <- correct_global(SummarizedCounts = sc) ``` Alternatively, for unstranded RNA-seq data, `r Biocpkg("CleanUpRNAseq")` offers a correction method leveraging GC-content bias on PCR amplification. ```{r GC_correction} gc_correction <- correct_GC( SummarizedCounts = sc, gene_gc = gene_GC, intergenic_gc = intergenic_GC, plot = FALSE ) ``` However, if the libraries are stranded, `r Biocpkg("CleanUpRNAseq")` considered as gene-level contamination the count table resulted from quantitation using the opposite strandedness setting. By subtracting the DNA contamination of each gene in each sample from the count table resulted from quantitation using the actual strandedness setting, users can get contamination corrected count table. To this end, use the `correct_stranded` function. ## Step 6. Check correction effect After correcting for DNA contamination, we expect to see more comparable gene expression across samples. Boxplots, density plots and empirical cumulative distribution after correction revealed gene expression across samples are more comparable. # Session info Here is the output of `sessionInfo()` on the system on which this document was compiled running pandoc `r rmarkdown::pandoc_version()`: ```{r sessionInfo, eval=TRUE, echo = FALSE} sessionInfo() ```