The Rbwa package
provides an R wrapper around the two popular
BWA aligners BWA-backtrack
(Li and Durbin 2009) and BWA-MEM
(Li 2013).
As mentioned in the BWA manual (see http://bio-bwa.sourceforge.net/bwa.shtml), BWA-backtrack is designed for short Illumina reads (up to 100bp), while BWA-MEM is more suitable for longer sequences (70bp to 1Mbp) and supports split alignment.
Rbwa
can be installed from Bioconductor using the
following commands in a fresh R session:
The two main alignment functions are:
bwa_aln
bwa_mem
The package also includes the following convenience functions:
bwa_build_index
bwa_aln
output into SAM output:
bwa_sam
xa2multi
bwa_build_index
Both bwa_aln
and bwa_mem
require first to
create a genome index from a FASTA file. This is done only once for a
given genome. This can be done using the function
bwa_build_index
.
First, we load Rbwa
:
In the following example code, we build a BWA index for a small
portion of human chr12 that we stored in a FASTA file located within the
Rbwa
package. We store the index files in a temporary
directory.
dir <- tempdir()
fasta <- system.file(package="Rbwa",
"fasta/chr12.fa")
index_prefix <- file.path(dir, "chr12")
bwa_build_index(fasta,
index_prefix=index_prefix)
list.files(dir)
## [1] "chr12.amb" "chr12.ann" "chr12.bwt" "chr12.pac" "chr12.sa"
bwa_aln
We now align read sequences stored in the toy example FASTQ file
fastq/sequences.fastq
, provided in the Rbwa
package, to our indexed genome:
fastq <- system.file(package="Rbwa",
"fastq/sequences.fastq")
bwa_aln(index_prefix=index_prefix,
fastq_files=fastq,
sai_files=file.path(dir, "output.sai"))
Any valid BWA arguments can be passed to the bwa_aln
function. To see the complete list of valid arguments, please visit the
BWA reference manual: http://bio-bwa.sourceforge.net/bwa.shtml.
For instance, we can specify the maximal edit distance between the
query sequence and the reference genome to be 3 using n
, as
well as the maximal edit distance in the seed sequence k
to
be 3, where we specify that the length of the seed sequence is 13 using
the argument l
:
bwa_aln(index_prefix=index_prefix,
fastq_files=fastq,
sai_files=file.path(dir, "output.sai"),
n=3, k=3, l=13)
The output of bwa_aln
is an intermediate
sai
file that should be converted into a sam
file using the bwa_sam
function as follows:
bwa_sam(index_prefix=index_prefix,
fastq_files=fastq,
sai_files=file.path(dir, "output.sai"),
sam_file=file.path(dir, "output.sam"))
Let’s read the first few lines of the SAM file:
## [1] "@SQ\tSN:chr12\tLN:171330"
## [2] "@PG\tID:bwa\tPN:bwa\tVN:0.7.17-r1198-dirty\tCL:/tmp/RtmpRt1Iia/Rinst95016"
## [3] "ACATCAGAAAGAGCGGCAG\t0\tchr12\t170881\t37\t19M\t*\t0\t0\tACATCAGAAAGAGCGGCAG\t~~~~~~~"
## [4] "CAACCCAGCCCCCCTCCAA\t0\tchr12\t170801\t37\t19M\t*\t0\t0\tCAACCCAGCCCCCCTCCAA\t~~~~~~~"
## [5] "CCTGTGATCCACGGAGGCT\t0\tchr12\t170765\t37\t19M\t*\t0\t0\tCCTGTGATCCACGGAGGCT\t~~~~~~~"
## [6] "GCACTGCGGTGAGTGCTGT\t0\tchr12\t170665\t37\t19M\t*\t0\t0\tGCACTGCGGTGAGTGCTGT\t~~~~~~~"
## [7] "GCCTTTTACAGTTCGTACT\t0\tchr12\t170820\t37\t19M\t*\t0\t0\tGCCTTTTACAGTTCGTACT\t~~~~~~~"
## [8] "GTCATGCCCCCTCAGCCAG\t0\tchr12\t170703\t37\t19M\t*\t0\t0\tGTCATGCCCCCTCAGCCAG\t~~~~~~~"
## [9] "TCGGCTCTCACCGTGTCCG\t0\tchr12\t170646\t37\t19M\t*\t0\t0\tTCGGCTCTCACCGTGTCCG\t~~~~~~~"
By default, each row of the SAM output corresponds to the best alignment hit for a given input query sequence. Other alignments (secondary alignments, or other loci in case of multiple alignments) are stored in the XA tag.
The function xa2multi
conveniently extracts the
alignments from the XA tags and represent them as additional rows in the
SAM format. This can be executed as follows:
xa2multi(file.path(dir, "output.sam"),
file.path(dir, "output.multi.sam"))
strtrim(readLines(file.path(dir, "output.multi.sam")), 65)
## [1] "@SQ\tSN:chr12\tLN:171330"
## [2] "@PG\tID:bwa\tPN:bwa\tVN:0.7.17-r1198-dirty\tCL:/tmp/RtmpRt1Iia/Rinst95016"
## [3] "ACATCAGAAAGAGCGGCAG\t0\tchr12\t170881\t37\t19M\t*\t0\t0\tACATCAGAAAGAGCGGCAG\t~~~~~~~"
## [4] "CAACCCAGCCCCCCTCCAA\t0\tchr12\t170801\t37\t19M\t*\t0\t0\tCAACCCAGCCCCCCTCCAA\t~~~~~~~"
## [5] "CCTGTGATCCACGGAGGCT\t0\tchr12\t170765\t37\t19M\t*\t0\t0\tCCTGTGATCCACGGAGGCT\t~~~~~~~"
## [6] "GCACTGCGGTGAGTGCTGT\t0\tchr12\t170665\t37\t19M\t*\t0\t0\tGCACTGCGGTGAGTGCTGT\t~~~~~~~"
## [7] "GCCTTTTACAGTTCGTACT\t0\tchr12\t170820\t37\t19M\t*\t0\t0\tGCCTTTTACAGTTCGTACT\t~~~~~~~"
## [8] "GTCATGCCCCCTCAGCCAG\t0\tchr12\t170703\t37\t19M\t*\t0\t0\tGTCATGCCCCCTCAGCCAG\t~~~~~~~"
## [9] "TCGGCTCTCACCGTGTCCG\t0\tchr12\t170646\t37\t19M\t*\t0\t0\tTCGGCTCTCACCGTGTCCG\t~~~~~~~"
bwa_mem
The bwa_mem
function works similar to the
bwa_aln
function, except that it does not produce
intermediate .sai
files; it outputs a SAM file
directly:
fastq <- system.file(package="Rbwa",
"fastq/sequences.fastq")
bwa_mem(index_prefix=index_prefix,
fastq_files=fastq,
sam_file=file.path(dir, "output.sam"))
## [1] "@SQ\tSN:chr12\tLN:171330"
## [2] "@PG\tID:bwa\tPN:bwa\tVN:0.7.17-r1198-dirty\tCL:/tmp/RtmpRt1Iia/Rinst95016"
## [3] "ACATCAGAAAGAGCGGCAG\t4\t*\t0\t0\t*\t*\t0\t0\tACATCAGAAAGAGCGGCAG\t~~~~~~~~~~~~~~~~~~~\t"
## [4] "CAACCCAGCCCCCCTCCAA\t4\t*\t0\t0\t*\t*\t0\t0\tCAACCCAGCCCCCCTCCAA\t~~~~~~~~~~~~~~~~~~~\t"
## [5] "CCTGTGATCCACGGAGGCT\t4\t*\t0\t0\t*\t*\t0\t0\tCCTGTGATCCACGGAGGCT\t~~~~~~~~~~~~~~~~~~~\t"
## [6] "GCACTGCGGTGAGTGCTGT\t4\t*\t0\t0\t*\t*\t0\t0\tGCACTGCGGTGAGTGCTGT\t~~~~~~~~~~~~~~~~~~~\t"
## [7] "GCCTTTTACAGTTCGTACT\t4\t*\t0\t0\t*\t*\t0\t0\tGCCTTTTACAGTTCGTACT\t~~~~~~~~~~~~~~~~~~~\t"
## [8] "GTCATGCCCCCTCAGCCAG\t4\t*\t0\t0\t*\t*\t0\t0\tGTCATGCCCCCTCAGCCAG\t~~~~~~~~~~~~~~~~~~~\t"
## [9] "TCGGCTCTCACCGTGTCCG\t4\t*\t0\t0\t*\t*\t0\t0\tTCGGCTCTCACCGTGTCCG\t~~~~~~~~~~~~~~~~~~~\t"
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Rbwa_1.11.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0
## [4] xfun_0.49 maketools_1.3.1 cachem_1.1.0
## [7] knitr_1.49 htmltools_0.5.8.1 rmarkdown_2.29
## [10] buildtools_1.0.0 lifecycle_1.0.4 cli_3.6.3
## [13] sass_0.4.9 jquerylib_0.1.4 compiler_4.4.2
## [16] sys_3.4.3 tools_4.4.2 evaluate_1.0.1
## [19] bslib_0.8.0 yaml_2.3.10 BiocManager_1.30.25
## [22] jsonlite_1.8.9 rlang_1.1.4