RNA sequencing (RNA-seq) libraries may contain genomic DNA (gDNA) contamination due to an absent or inefficient gDNA digestion step (with DNase) during RNA extraction or library preparation. In fact, some protocols do not include a DNase treatment step, or they include it as optional.
While gDNA contamination is not a major issue in libraries built with poly(A) selected RNA molecules, it can remarkably affect gene expression quantification from libraries of total RNA. When present, gDNA contamination can lead to a misleading attribution of expression to unannotated regions of the genome. For this reason, it is important to check the levels of gDNA contamination during quality control before performing further analyses, specially when total RNA has been sequenced.
Here we illustrate the use of the gDNAx package for producing different diagnostics and how do they reveal different gDNA contamination levels. We use a subset of the data in (Li et al. 2022), which consists of 9 paired-end samples of total RNA-seq with increasing levels of gDNA contamination: 0% (no contamination), 1% and 10%, with 3 replicates each. The data is available through the Bioconductor experiment data package gDNAinRNAseqData, which allows one to download 9 BAM files, containing about 100,000 alignments, sampled uniformly at random from the complete BAM files.
library(gDNAinRNAseqData)
# Retrieve BAM files
bamfiles <- LiYu22subsetBAMfiles()
bamfiles
[1] "/tmp/Rtmp1vCECM/s32gDNA0.bam" "/tmp/Rtmp1vCECM/s33gDNA0.bam"
[3] "/tmp/Rtmp1vCECM/s34gDNA0.bam" "/tmp/Rtmp1vCECM/s26gDNA1.bam"
[5] "/tmp/Rtmp1vCECM/s27gDNA1.bam" "/tmp/Rtmp1vCECM/s28gDNA1.bam"
[7] "/tmp/Rtmp1vCECM/s23gDNA10.bam" "/tmp/Rtmp1vCECM/s24gDNA10.bam"
[9] "/tmp/Rtmp1vCECM/s25gDNA10.bam"
# Retrieve information on the gDNA concentrations of each BAM file
pdat <- LiYu22phenoData(bamfiles)
pdat
gDNA
s32gDNA0 0
s33gDNA0 0
s34gDNA0 0
s26gDNA1 1
s27gDNA1 1
s28gDNA1 1
s23gDNA10 10
s24gDNA10 10
s25gDNA10 10
Diagnosing the presence of gDNA contamination requires using an
annotation of genes and transcripts. The gDNAx package expects
that we provide such an annotation using a so-called TxDb
package, either as a TxDb
object, created once such a
package is loaded into the R session, or by specifying the name of the
package. The Bioconductor website
provides a number of TxDb
packages, but if the we do not
find the one we are looking for, we can build a TxDb
object
using the function makeTxDbFromGFF()
on a given GFF or
GTF
file, or any of the other makeTxDbFrom*()
functions,
available in the GenomicFeatures
package. Here we load the TxDb
package corresponding to the
GENCODE annotation provided by the UCSC Genome Browser.
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg38
# Organism: Homo sapiens
# Taxonomy ID: 9606
# UCSC Table: knownGene
# UCSC Track: GENCODE V46
# Resource URL: https://genome.ucsc.edu/
# Type of Gene ID: Entrez Gene ID
# Full dataset: yes
# miRBase build ID: NA
# Nb of transcripts: 278220
# Db created by: txdbmaker package from Bioconductor
# Creation time: 2024-09-24 16:26:51 +0000 (Tue, 24 Sep 2024)
# txdbmaker version at creation time: 1.1.1
# RSQLite version at creation time: 2.3.7
# DBSCHEMAVERSION: 1.2
We can calculate diagnostics for gDNA contamination using the
function gDNAdx()
as follows.
library(gDNAx)
gdnax <- gDNAdx(bamfiles, txdb)
class(gdnax)
[1] "gDNAx"
attr(,"package")
[1] "gDNAx"
gdnax
gDNAx object
# BAM files (9): s32gDNA0.bam, ..., s25gDNA10.bam
# Library layout: paired-end, 9 (2x50nt)
# Library protocol: unstranded (9 out of 9)
# Sequences: only standard chromosomes
# Annotation pkg: TxDb.Hsapiens.UCSC.hg38.knownGene
# Alignments employed: first 100000
The previous call will show progress through its calculations unless
we set the argument verbose=FALSE
, and return an object of
class gDNAx
once it has finished. We have let the
gDNAdx()
function figure out the library layout and
protocol, but if we already knew those parameters from the data, we
could set them through the arguments singleEnd
and
strandMode
and speed up calculations. Another way to speed
up calculations, which may be advantageous specially when analysing a
high number of BAM files, is to use the BPPARAM
argument to
set a number of parallel threads of execution; see the help page of
gDNAdx()
for full details on how to specify non-default
values to all these parameters.
Calling the plot()
function with the resulting object
gDNAx
object as the first argument will plot several
diagnostics. Here below, we also use a parameter called
group
to automatically color samples, in this case, by the
gDNA contamination levels included in the experimental design of the
data; see (Li et al. 2022) for full
details on it.
The previous figure contains three diagnostic plots, each one showing the following values as a function of the percentage of read alignments fully contained in intergenic regions (IGC):
These data appear to come from an unstranded library, but if they would be stranded, a fourth diagnostic plot would appear showing an estimated value of the strandedness of each sample as function of the percentage of intergenic alignments. In stranded RNA-seq data, we should expect strandedness values close to 1, which imply that most reads align to the same strand than the annotated transcripts. Lower strandedness values can be indicative of gDNA contamination because reads sequenced from DNA are expected to align in equal proportions to both strands.
Because IGC alignments mainly originate from gDNA contamination, we may expect a negative correlation between the percentage of SCJ or SCE alignments and the percentage of IGC alignments. On the other hand, INT alignments may originate either from primary unprocessed transcripts in the nucleus, or from gDNA contamination as well. Therefore, we may also expect some positive correlation between the percentages of INT and IGC alignments, as it happens in this data.
Using the function getDx()
on the gDNAx
object, we obtain all the values used in the diagnostics.
dx <- getDx(gdnax)
dx
IGC INT SCJ SCE JNC IGCFLM SCJFLM SCEFLM
s32gDNA0 1.029500 11.76099 15.18463 40.02910 19.56452 174.206 162.057 160.640
s33gDNA0 1.118422 11.78807 15.21255 40.30533 19.55383 181.515 158.654 162.459
s34gDNA0 1.085408 12.26652 15.40036 40.18418 19.70788 170.984 156.824 153.233
s26gDNA1 1.394822 12.22450 14.78250 38.69502 18.73132 174.734 158.827 163.466
s27gDNA1 1.359395 12.44973 14.53507 38.20584 18.31765 173.362 164.782 163.763
s28gDNA1 1.491097 12.49686 14.09755 37.66753 17.96053 173.017 159.159 167.279
s23gDNA10 3.505813 13.17887 11.22446 30.99074 14.41110 197.262 161.201 164.646
s24gDNA10 3.756376 13.67709 10.84996 30.00253 13.91748 180.635 158.263 164.014
s25gDNA10 3.385574 13.53019 11.19650 30.63859 14.20691 168.728 159.881 161.126
INTFLM STRAND
s32gDNA0 156.952 NA
s33gDNA0 158.998 NA
s34gDNA0 152.887 NA
s26gDNA1 159.084 NA
s27gDNA1 159.457 NA
s28gDNA1 159.175 NA
s23gDNA10 159.352 NA
s24gDNA10 155.143 NA
s25gDNA10 157.692 NA
The column JNC
contains the percentage of alignments
that include one or more junctions, irrespective of whether those
aligments are compatible with an spliced transcript in the given
annotation. The columns with the suffix FLM
contain an
estimation of the fragment length mean in the alignments originating in
the corresponding region, and the column STRAND
stores the
strandedness values, which in this case are NA
because this
dataset is not strand-specific.
We can directly plot the estimated fragments length distributions
with the function plotFrgLength()
.
Another way to represent some of diagnostic measurements is to examine the origin of the alignments per sample in percentages. Fluctuations of these proportions across samples can help quantifying the amount of gDNA contamination per sample.
If we are interested in knowing exactly which annotations of
intergenic and intronic regions have been used to compute these
diagnostics, we can easily retrieve them using the functions
getIgc()
and getInt()
on the output
gDNAx
object, respectively.
igcann <- getIgc(gdnax)
igcann
GRanges object with 910526 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 1-10000 *
[2] chr1 14410-14695 *
[3] chr1 24887-26355 *
[4] chr1 26413-26582 *
[5] chr1 27138-27268 *
... ... ... ...
[910522] chrM 2746-3229 *
[910523] chrM 3308-4328 *
[910524] chrM 4401-7447 *
[910525] chrM 7515-16204 *
[910526] chrM 16250-16569 *
-------
seqinfo: 25 sequences (1 circular) from hg38 genome
intann <- getInt(gdnax)
intann
GRanges object with 1493398 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 12228-12612 *
[2] chr1 12722-12974 *
[3] chr1 13053-13220 *
[4] chr1 14830-14969 *
[5] chr1 15039-15264 *
... ... ... ...
[1493394] chrY 57207675-57208210 *
[1493395] chrY 57208313-57208518 *
[1493396] chrY 57213358-57213525 *
[1493397] chrY 57213603-57213879 *
[1493398] chrY 57213965-57214349 *
-------
seqinfo: 25 sequences (1 circular) from hg38 genome
Since we have let the gDNAdx()
function to estimate
strandedness, we can examine those estimated values using the getter
function strandedness()
on the gDNAx
object.
strandedness(gdnax)
strandMode1 strandMode2 ambig Nalignments
s32gDNA0 0.4804169 0.4783929 0.04119024 66205
s33gDNA0 0.4801948 0.4780386 0.04176656 66321
s34gDNA0 0.4838895 0.4747806 0.04132993 66199
s26gDNA1 0.4845803 0.4756344 0.03978530 63717
s27gDNA1 0.4862653 0.4727614 0.04097329 62797
s28gDNA1 0.4800895 0.4789145 0.04099601 62128
s23gDNA10 0.4816028 0.4763408 0.04205635 50361
s24gDNA10 0.4778240 0.4817199 0.04045603 48769
s25gDNA10 0.4795262 0.4782635 0.04221033 49893
Using the function classifyStrandMode()
we can obtain a
classification of the most likely strand mode for each BAM file, given
some default cutoff values.
classifyStrandMode(strandedness(gdnax))
s32gDNA0 s33gDNA0 s34gDNA0 s26gDNA1 s27gDNA1 s28gDNA1 s23gDNA10 s24gDNA10
NA NA NA NA NA NA NA NA
s25gDNA10
NA
Li et al. (2022) report in their
publication that “sequencing libraries were generated using a TruSeq
Stranded Total RNA Library Prep Kit”. However, we can see that the
proportion of alignments overlapping transcripts in the column
strandMode1
is very similar to the one in the column
strandMode2
, which is compatible with an unstranded library
and the reason why we obtain NA
values in the output of
classifyStrandMode()
. We reach the same conclusion if we
use the RSeQC tool infer_experiment.py
(Wang, Wang, and Li 2012) and by visual
inspection of the alignment data in the Integrative Genomics Viewer
(IGV) (Robinson et al. 2011).
Following the recommendations made by Signal
and Kahlke (2022), gDNAx
attempts to use at least
200,000 alignments overlapping exonic regions to estimate strandedness.
In the subset of data used in this vignette, the number of alignments
used for that estimation is close to 60,000, which is the total number
of exonic alignments present in the BAM files.
If we are only interested in the estimation of strandedness values,
we can can also directly call strandedness()
with a
character string vector of BAM filenames and a TxDb
annotation object; see the help page of strandedness()
.
We can attempt removing read alignments from putative gDNA origin
using the function gDNAtx()
, which should be called with
the gDNAx
object returned by gDNAdx()
and a
path in the filesystem where to stored the filtered BAM files. By
default, these filtered BAM files include splice-compatible read
alignments (SCJ and SCE) that are found in a genomic window enriched for
stranded alignments. For further fine tuning of this filtering strategy
please use the function filterBAMtx()
.
## fbf <- filterBAMtxFlag(isSpliceCompatibleJunction=TRUE,
## isSpliceCompatibleExonic=TRUE)
## fstats <- filterBAMtx(gdnax, path=tmpdir, txflag=fbf)
## fstats
tmpdir <- tempdir()
fstats <- gDNAtx(gdnax, path=tmpdir)
fstats
NALN NIGC NINT NSCJ NSCE NSTW NNCH
s32gDNA0 99660 NA NA 15134 39905 46336 340
s33gDNA0 99694 NA NA 15170 40189 46823 306
s34gDNA0 99686 NA NA 15352 40068 46529 314
s26gDNA1 99726 NA NA 14743 38586 45061 274
s27gDNA1 99456 NA NA 14466 38011 45083 544
s28gDNA1 99457 NA NA 14023 37470 43892 543
s23gDNA10 99007 NA NA 11115 30690 36424 993
s24gDNA10 99005 NA NA 10746 29716 35032 995
s25gDNA10 99156 NA NA 11103 30394 36039 844
The first column NALN
corresponds to the total number of
read alignments processed. Columns NIGC
to
NSCE
contain the number of selected alignments from each
corresponding origin, where NA
indicates that that type of
alignment was not selected for filtering. The column NSTW
corresponds to selected alignments occurring in stranded windows, and
therefore this number will be always equal or smaller than the number of
the previous columns. The column NNCH
corresponds to
discarded read alignments ocurring in non-standard chromosomes.
sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] gDNAx_1.5.0
[2] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0
[3] GenomicFeatures_1.59.1
[4] AnnotationDbi_1.69.0
[5] Biobase_2.67.0
[6] Rsamtools_2.23.0
[7] Biostrings_2.75.1
[8] XVector_0.47.0
[9] GenomicRanges_1.59.1
[10] GenomeInfoDb_1.43.1
[11] IRanges_2.41.1
[12] S4Vectors_0.45.2
[13] BiocGenerics_0.53.3
[14] generics_0.1.3
[15] gDNAinRNAseqData_1.6.0
[16] knitr_1.49
[17] BiocStyle_2.35.0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 dplyr_1.1.4
[3] blob_1.2.4 filelock_1.0.3
[5] bitops_1.0-9 fastmap_1.2.0
[7] RCurl_1.98-1.16 BiocFileCache_2.15.0
[9] VariantAnnotation_1.53.0 GenomicAlignments_1.43.0
[11] XML_3.99-0.17 digest_0.6.37
[13] mime_0.12 lifecycle_1.0.4
[15] KEGGREST_1.47.0 RSQLite_2.3.8
[17] magrittr_2.0.3 compiler_4.4.2
[19] rlang_1.1.4 sass_0.4.9
[21] tools_4.4.2 plotrix_3.8-4
[23] utf8_1.2.4 yaml_2.3.10
[25] rtracklayer_1.67.0 S4Arrays_1.7.1
[27] bit_4.5.0 curl_6.0.1
[29] DelayedArray_0.33.2 RColorBrewer_1.1-3
[31] abind_1.4-8 BiocParallel_1.41.0
[33] withr_3.0.2 purrr_1.0.2
[35] sys_3.4.3 grid_4.4.2
[37] fansi_1.0.6 ExperimentHub_2.15.0
[39] SummarizedExperiment_1.37.0 cli_3.6.3
[41] rmarkdown_2.29 crayon_1.5.3
[43] httr_1.4.7 rjson_0.2.23
[45] DBI_1.2.3 cachem_1.1.0
[47] zlibbioc_1.52.0 parallel_4.4.2
[49] BiocManager_1.30.25 restfulr_0.0.15
[51] matrixStats_1.4.1 vctrs_0.6.5
[53] Matrix_1.7-1 jsonlite_1.8.9
[55] bit64_4.5.2 GenomicFiles_1.43.0
[57] maketools_1.3.1 jquerylib_0.1.4
[59] glue_1.8.0 codetools_0.2-20
[61] BiocVersion_3.21.1 BiocIO_1.17.0
[63] UCSC.utils_1.3.0 tibble_3.2.1
[65] pillar_1.9.0 rappdirs_0.3.3
[67] htmltools_0.5.8.1 BSgenome_1.75.0
[69] GenomeInfoDbData_1.2.13 R6_2.5.1
[71] dbplyr_2.5.0 lattice_0.22-6
[73] evaluate_1.0.1 AnnotationHub_3.15.0
[75] png_0.1-8 memoise_2.0.1
[77] bslib_0.8.0 SparseArray_1.7.2
[79] xfun_0.49 MatrixGenerics_1.19.0
[81] buildtools_1.0.0 pkgconfig_2.0.3