This vignette belongs to the GenomicAlignments package. It illustrates how to use the package for working with the nucleotide content of aligned reads.
After the reads generated by a high-throughput sequencing experiment have been aligned to a reference genome, the questions that are being asked about these alignments typically fall in two broad categories: positional only and nucleotide-related.
Positional only questions are about the position of
the alignments with respect to the reference genome. Note that the
position of an alignment is actually better described in terms of
genomic ranges (1 range for an alignment with no gaps, 2 or more ranges
for an alignment with gaps). Knowing the ranges of the alignments is
sufficient to perform common tasks like read counting or for
computing the coverage. Read counting is the process
of counting the number of aligned reads per gene or exon and is
typically performed in the context of a differential analysis. This task
can be accomplished with the summarizeOverlaps
function
provided in the package and is explained in details in the “Counting
reads with summarizeOverlaps” vignette (also located in this package).
Computing the coverage is often the preliminary step to peak
detection (ChIP-seq analysis) or to a copy number analysis. It can be
accomplished with the coverage
function. See
?'coverage-methods'
for more information.
Nucleotide-related questions are about the nucleotide content of the alignments. In particular, how this content compares to the corresponding nucleotides in the reference genome. These questions typically arise in the context of small genetic variation detection between one or more samples and a reference genome.
The GenomicAlignments package provides a suite of low- to mid-level tools for dealing with nucleotide-related questions about the alignments. In this vignette, we illustrate their use on the single-end and paired-end reads of an RNA-seq experiment.
Note that these tools do NOT constitute a complete variant toolbox. If this is what you’re looking for, other Bioconductor packages might be more appropriate. See the GeneticVariability and SNP views at this URL https://bioconductor.org/packages/release/BiocViews.html#___AssayDomain for a complete list of packages that deal with small genetic variations. Most of them provide tools of higher level than the tools described in this vignette. See for example the VariantTools and VariantAnnotation packages for a complete variant toolbox (including variant calling capabilities).
In this section, we illustrate how aligned reads and their sequences can be loaded from a BAM file. The reads we’re going to use for this are paired-end reads from a published study by Zarnack et al., 2012. A subset of these reads are stored in the BAM files located in the data RNAseqData.HNRNPC.bam.chr14 data package. The package contains 8 BAM files, 1 per sequencing run:
library(RNAseqData.HNRNPC.bam.chr14)
bamfiles <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
names(bamfiles) # the names of the runs
## [1] "ERR127306" "ERR127307" "ERR127308" "ERR127309" "ERR127302" "ERR127303"
## [7] "ERR127304" "ERR127305"
Each BAM file was obtained by (1) aligning the reads (paired-end) to
the full hg19 genome with TopHat2, and then (2) subsetting to keep only
alignments on chr14. See ?RNAseqData.HNRNPS.bam.chr14
for
more information about this data set.
As a preliminary step, we check whether the BAM files contain single-
or paired-end alignments. This can be done with the
quickBamFlagSummary
utility from the Rsamtools
package:
## group | nb of | nb of | mean / max
## of | records | unique | records per
## records | in group | QNAMEs | unique QNAME
## All records........................ A | 800484 | 393300 | 2.04 / 10
## o template has single segment.... S | 0 | 0 | NA / NA
## o template has multiple segments. M | 800484 | 393300 | 2.04 / 10
## - first segment.............. F | 400242 | 393300 | 1.02 / 5
## - last segment............... L | 400242 | 393300 | 1.02 / 5
## - other segment.............. O | 0 | 0 | NA / NA
##
## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
## Indentation reflects this.
This confirms that all the alignments in the 1st BAM file (run
ERR127306) are paired-end. This means that we should preferably load
them with the readGAlignmentsPairs
function from the GenomicAlignments
package. However for the purpose of keeping things simple, we will
ignore the pairing for now and load only the alignments corresponding to
the first segment of the pairs. We will use the
readGAlignments
function from the GenomicAlignments
package for this, together with a ScanBamParam object for the
filtering. See ?ScanBamParam
in the Rsamtools
package for the details.
Furthermore, while preparing the ScanBamParam object to perform the filtering, we’ll also get rid of the PCR or optical duplicates (flag bit 0x400 in the SAM format, see the SAM Spec 1 for the details), as well as reads not passing quality controls (flag bit 0x200 in the SAM format).
Finally we also request the read sequences (a.k.a. the segment sequences in the SAM Spec, stored in the SEQ field) via the ScanBamParam object:
flag1 <- scanBamFlag(isFirstMateRead=TRUE, isSecondMateRead=FALSE, isDuplicate=FALSE, isNotPassingQualityControls=FALSE)
param1 <- ScanBamParam(flag=flag1, what="seq")
We’re now ready to load the alignments and their sequences. Note that
we use use.names=TRUE
in order to also load the query
names (a.k.a. the query template names in the SAM Spec,
stored in the QNAME field) from the BAM file.
readGAlignments
will use them to set the names of the
returned object:
This returns a GAlignments object. The read sequences are
stored in the seq
metadata column of the object:
## DNAStringSet object of length 400242:
## width seq
## [1] 72 TGAGAATGATGATTTCCAATTTCATCCATGT...GACATGAACTCATCATTTTTTATGGCTGCAT
## [2] 72 CCCCAGGTATACACTGGACTCCAGGTGGACA...GTTGGATACACACACTCAAGGTGGACACCAG
## [3] 72 CATAGATGCAAGAATCCTCAATCAAATACTA...TTCAACAGCACATTAAAAAGATAACTTACCA
## [4] 72 TGCTGGTGCAGGATTTATTCTACTAAGCAAT...ATCAAATCCACTTTCTTATCTCAGGAATCAG
## [5] 72 CAGGAGGTAGGCTGTGCGTTCAGCAGTTGGT...GGTACTGGTTGATCACCTTGACTGTCTGGTC
## ... ... ...
## [400238] 72 GGGAGGCCCTTTATATAACCATCAGGTCTTG...CTCACTAATAGGATAAAAGCATGGAGAGAAC
## [400239] 72 CCTGAGAGCCCCTTGCTGTCCTGAGCACCTC...GAGCGCCCTCTGGTGTTCTGATCACTCTCTG
## [400240] 72 CAACTTTTATTTCTTAAACACAAGACATTCC...CTGTTCTCAGGTGAGCTGTCGAGCAGGGAGG
## [400241] 72 CAAAGCTGGATGTGTCTAGTGTTTTTATCAG...CCGTAATAAGAGCATGTGTGGTTTTGCTGCC
## [400242] 72 CATGACTTGATGGCTGGAACAAATACATTTA...CTCCAATACTAGCCTTTGCCATACAGTATTT
Because the BAM format imposes that the read sequence is “reverse complemented” when a read aligns to the minus strand, we need to “reverse complement” it again to restore the original query sequences (i.e. the sequences before alignment, that is, as they can be seen in the FASTQ file assuming that the aligner didn’t perform any hard-clipping on them):
oqseq1 <- mcols(gal1)$seq
is_on_minus <- as.logical(strand(gal1) == "-")
oqseq1[is_on_minus] <- reverseComplement(oqseq1[is_on_minus])
Because the aligner used to align the reads can report more than 1
alignment per read (i.e. per sequence stored in the FASTQ file), we
shouldn’t expect the names of gal1
to be unique:
## is_dup
## FALSE TRUE
## 393300 6942
However, sequences with the same query name should correspond to the same original query and therefore should be the same. Let’s do a quick sanity check:
Finally, let’s oqseq1
reduce to one original query
sequence per unique query name (like in the FASTQ file
containing the 1st end of the unaligned reads):
Because the aligner possibly tolerated a small number of mismatches,
indels, and/or gaps during the alignment process, the sequences in
mcols(gal1)$seq
generally don’t match exactly the reference
genome.
The information of where indels and/or gaps occur in the alignments is represented in the CIGAR strings. Let’s have a look at these string. First the most frequent cigars:
##
## 72M 35M123N37M 64M316N8M 38M670N34M 18M123N54M 2M131N70M
## 301920 134 134 133 119 96
Then a summary of the total number of insertions (I), deletions (D), andcgaps (N):
## M I D N S H P =
## 28815928 1496 818 330945002 0 0 0 0
## X
## 0
This tells us that the aligner that was used supports indels (I/D) and junction reads (N).
Finally we count and summarize the number of gaps per alignment:
##
## 0 1 2 3
## 303622 94557 2052 11
Some reads contain up to 3 gaps (i.e. span 3 splice junctions).
TODO (with sequenceLayer
)
BAM file untreated3_chr4.bam (located in the pasillaBamSubset
data package) contains paired-end reads from the “Pasilla” experiment
and aligned against the dm3 genome (see ?untreated3_chr4
in
the pasillaBamSubset
package for more information about those reads). We use the
readGAlignmentPairs
function to load them into a
GAlignmentPairs object:
library(pasillaBamSubset)
flag0 <- scanBamFlag(isDuplicate=FALSE, isNotPassingQualityControls=FALSE)
param0 <- ScanBamParam(flag=flag0)
U3.galp <- readGAlignmentPairs(untreated3_chr4(), use.names=TRUE, param=param0)
head(U3.galp)
## GAlignmentPairs object with 6 pairs, strandMode=1, and 0 metadata columns:
## seqnames strand : ranges -- ranges
## <Rle> <Rle> : <IRanges> -- <IRanges>
## SRR031715.1138209 chr4 + : 169-205 -- 326-362
## SRR031714.756385 chr4 + : 943-979 -- 1086-1122
## SRR031714.2355189 chr4 + : 944-980 -- 1119-1155
## SRR031714.5054563 chr4 + : 946-982 -- 986-1022
## SRR031715.1722593 chr4 + : 966-1002 -- 1108-1144
## SRR031715.2202469 chr4 + : 966-1002 -- 1114-1150
## -------
## seqinfo: 8 sequences from an unspecified genome
The show
method for GAlignmentPairs objects
displays two ranges columns, one for the first alignment in the
pair (the left column), and one for the last alignment in the
pair (the right column). The strand column corresponds to the strand of
the first alignment.
## GAlignments object with 6 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## SRR031715.1138209 chr4 + 37M 37 169 205
## SRR031714.756385 chr4 + 37M 37 943 979
## SRR031714.2355189 chr4 + 37M 37 944 980
## SRR031714.5054563 chr4 + 37M 37 946 982
## SRR031715.1722593 chr4 + 37M 37 966 1002
## SRR031715.2202469 chr4 + 37M 37 966 1002
## width njunc
## <integer> <integer>
## SRR031715.1138209 37 0
## SRR031714.756385 37 0
## SRR031714.2355189 37 0
## SRR031714.5054563 37 0
## SRR031715.1722593 37 0
## SRR031715.2202469 37 0
## -------
## seqinfo: 8 sequences from an unspecified genome
## GAlignments object with 6 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## SRR031715.1138209 chr4 - 37M 37 326 362
## SRR031714.756385 chr4 - 37M 37 1086 1122
## SRR031714.2355189 chr4 - 37M 37 1119 1155
## SRR031714.5054563 chr4 - 37M 37 986 1022
## SRR031715.1722593 chr4 - 37M 37 1108 1144
## SRR031715.2202469 chr4 - 37M 37 1114 1150
## width njunc
## <integer> <integer>
## SRR031715.1138209 37 0
## SRR031714.756385 37 0
## SRR031714.2355189 37 0
## SRR031714.5054563 37 0
## SRR031715.1722593 37 0
## SRR031715.2202469 37 0
## -------
## seqinfo: 8 sequences from an unspecified genome
According to the SAM format specifications, the aligner is expected to mark each alignment pair as proper or not (flag bit 0x2 in the SAM format). The SAM Spec only says that a pair is proper if the first and last alignments in the pair are “properly aligned according to the aligner”. So the exact criteria used for setting this flag is left to the aligner.
We use isProperPair
to extract this flag from the
object:
##
## FALSE TRUE
## 29581 45828
Even though we could do overlap encodings with the full object, we keep only the proper pairs for our downstream analysis:
Because the aligner used to align those reads can report more than 1
alignment per original query template (i.e. per pair of
sequences stored in the input files, typically 1 FASTQ file for the
first ends and 1 FASTQ file for the last ends), we
shouldn’t expect the names of U3.GALP
to be unique:
## U3.GALP_names_is_dup
## FALSE TRUE
## 43659 2169
Storing the query template names in a factor will be useful:
as well as having the mapping between each query template
name and its first occurence in U3.GALP_qnames
:
Our reads can have up to 1 gap per end:
## [1] "37M" "6M58N31M" "25M56N12M" "19M62N18M" "29M222N8M" "9M222N28M"
## [1] "37M" "19M58N18M" "12M58N25M" "27M2339N10M" "29M2339N8M"
## [6] "9M222N28M"
##
## 0 1
## 0 44510 596
## 1 637 85
Like for our single-end reads, the following tables indicate that indels were not allowed/supported during the alignment process:
## M I D N S H P = X
## 1695636 0 0 673919 0 0 0 0 0
## M I D N S H P = X
## 1695636 0 0 630395 0 0 0 0 0
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] pasillaBamSubset_0.44.0 GenomicAlignments_1.43.0
## [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [5] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [7] Rsamtools_2.23.1 Biostrings_2.75.1
## [9] XVector_0.47.0 GenomicRanges_1.59.1
## [11] GenomeInfoDb_1.43.2 IRanges_2.41.1
## [13] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [15] generics_0.1.3 RNAseqData.HNRNPC.bam.chr14_0.44.0
## [17] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 SparseArray_1.7.2 bitops_1.0-9
## [4] lattice_0.22-6 digest_0.6.37 evaluate_1.0.1
## [7] grid_4.4.2 fastmap_1.2.0 jsonlite_1.8.9
## [10] Matrix_1.7-1 BiocManager_1.30.25 httr_1.4.7
## [13] UCSC.utils_1.3.0 codetools_0.2-20 jquerylib_0.1.4
## [16] abind_1.4-8 cli_3.6.3 rlang_1.1.4
## [19] crayon_1.5.3 cachem_1.1.0 DelayedArray_0.33.2
## [22] yaml_2.3.10 S4Arrays_1.7.1 tools_4.4.2
## [25] parallel_4.4.2 BiocParallel_1.41.0 GenomeInfoDbData_1.2.13
## [28] buildtools_1.0.0 R6_2.5.1 lifecycle_1.0.4
## [31] zlibbioc_1.52.0 bslib_0.8.0 xfun_0.49
## [34] sys_3.4.3 knitr_1.49 htmltools_0.5.8.1
## [37] rmarkdown_2.29 maketools_1.3.1 compiler_4.4.2