crisprBwa: alignment of gRNA spacer sequences using BWA

Overview of crisprBwa

crisprBwa provides two main functions to align short DNA sequences to a reference genome using the short read aligner BWA-backtrack (Li and Durbin 2009) and return the alignments as R objects: runBwa and runCrisprBwa. It utilizes the Bioconductor package Rbwa to access the BWA program in a platform-independent manner. This means that users do not need to install BWA prior to using crisprBwa.

The latter function (runCrisprBwa) is specifically designed to map and annotate CRISPR guide RNA (gRNA) spacer sequences using CRISPR nuclease objects and CRISPR genomic arithmetics defined in the Bioconductor package crisprBase. This enables a fast and accurate on-target and off-target search of gRNA spacer sequences for virtually any type of CRISPR nucleases. It also provides an off-target search engine for our main gRNA design package crisprDesign of the crisprVerse ecosystem. See the addSpacerAlignments function in crisprDesign for more details.

Installation and getting started

Software requirements

OS Requirements

This package is supported for macOS and Linux only. Package was developed and tested on R version 4.2.1.

Installation from Bioconductor

crisprBwa can be installed from from the Bioconductor devel branch using the following commands in a fresh R session:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(version="devel")
BiocManager::install("crisprBwa")

Building a bwa index

To use runBwa or runCrisprBwa, users need to first build a BWA genome index. For a given genome, this step has to be done only once. The Rbwa package conveniently provides the function bwa_build_index to build a BWA index from any custom genome from a FASTA file.

As an example, we build a BWA index for a small portion of the human chromosome 12 (chr12.fa file provided in the crisprBwa package) and save the index file as myIndex to a temporary directory:

library(Rbwa)
fasta <- system.file(package="crisprBwa", "example/chr12.fa")
outdir <- tempdir()
index <- file.path(outdir, "chr12")
Rbwa::bwa_build_index(fasta,
                      index_prefix=index)

To learn how to create a BWA index for a complete genome or transcriptome, please visit our tutorial page.

Alignment using runCrisprBwa

As an example, we align 5 spacer sequences (of length 20bp) to the custom genome built above, allowing a maximum of 3 mismatches between the spacer and protospacer sequences.

We specify that the search is for the wildtype Cas9 (SpCas9) nuclease by providing the CrisprNuclease object SpCas9 available through the crisprBase package. The argument canonical=FALSE specifies that non-canonical PAM sequences are also considered (NAG and NGA for SpCas9). The function getAvailableCrisprNucleases in crisprBase returns a character vector of available crisprNuclease objects found in crisprBase.

We also need to provide a BSgenome object corresponding to the reference genome used for alignment to extract protospacer and PAM sequences of the target sequences.

library(crisprBwa)
library(BSgenome.Hsapiens.UCSC.hg38)
## Loading required package: GenomeInfoDb
## Loading required package: BiocGenerics
## Loading required package: generics
## 
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
##     unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: BSgenome
## Loading required package: GenomicRanges
## Loading required package: Biostrings
## Loading required package: XVector
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## Loading required package: BiocIO
## Loading required package: rtracklayer
## 
## Attaching package: 'rtracklayer'
## The following object is masked from 'package:BiocIO':
## 
##     FileForFormat
data(SpCas9, package="crisprBase")
crisprNuclease <- SpCas9
bsgenome <- BSgenome.Hsapiens.UCSC.hg38
spacers <- c("AGCTGTCCGTGGGGGTCCGC",
             "CCCCTGCTGCTGTGCCAGGC",
             "ACGAACTGTAAAAGGCTTGG",
             "ACGAACTGTAACAGGCTTGG",
             "AAGGCCCTCAGAGTAATTAC")
runCrisprBwa(spacers,
             bsgenome=bsgenome,
             crisprNuclease=crisprNuclease,
             n_mismatches=3,
             canonical=FALSE,
             bwa_index=index)
## [runCrisprBwa] Using BSgenome.Hsapiens.UCSC.hg38 
## [runCrisprBwa] Searching for SpCas9 protospacers
##                 spacer          protospacer pam   chr pam_site strand
## 1 AAGGCCCTCAGAGTAATTAC AAGGCCCTCAGAGTAATTAC AGA chr12   170636      +
## 2 ACGAACTGTAAAAGGCTTGG ACGAACTGTAAAAGGCTTGG AGG chr12   170815      -
## 3 ACGAACTGTAACAGGCTTGG ACGAACTGTAAAAGGCTTGG AGG chr12   170815      -
## 4 AGCTGTCCGTGGGGGTCCGC AGCTGTCCGTGGGGGTCCGC AGG chr12   170585      +
## 5 CCCCTGCTGCTGTGCCAGGC CCCCTGCTGCTGTGCCAGGC CGG chr12   170609      +
##   n_mismatches canonical
## 1            0     FALSE
## 2            0      TRUE
## 3            1      TRUE
## 4            0      TRUE
## 5            0      TRUE

Applications beyond CRISPR

The function runBwa is similar to runCrisprBwa, but does not impose constraints on PAM sequences. It can be used to search for any short read sequence in a genome.

Example using RNAi (siRNA design)

Seed-related off-targets caused by mismatch tolerance outside of the seed region is a well-studied and characterized problem observed in RNA interference (RNAi) experiments. runBWa can be used to map shRNA/siRNA seed sequences to reference genomes to predict putative off-targets:

seeds <- c("GTAAGCGGAGTGT", "AACGGGGAGATTG")
runBwa(seeds,
       n_mismatches=2,
       bwa_index=index)
##           query   chr    pos strand n_mismatches
## 1 AACGGGGAGATTG chr12  68337      -            2
## 2 AACGGGGAGATTG chr12   1666      -            2
## 3 AACGGGGAGATTG chr12 123863      +            2
## 4 AACGGGGAGATTG chr12 151731      -            2
## 5 AACGGGGAGATTG chr12 110901      +            2
## 6 GTAAGCGGAGTGT chr12 101550      -            2

Reproducibility

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.75.0                  
##  [3] rtracklayer_1.67.0                BiocIO_1.17.1                    
##  [5] Biostrings_2.75.3                 XVector_0.47.0                   
##  [7] GenomicRanges_1.59.1              GenomeInfoDb_1.43.2              
##  [9] IRanges_2.41.2                    S4Vectors_0.45.2                 
## [11] BiocGenerics_0.53.3               generics_0.1.3                   
## [13] crisprBwa_1.11.0                  Rbwa_1.11.0                      
## [15] BiocStyle_2.35.0                 
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.37.0 rjson_0.2.23               
##  [3] xfun_0.49                   bslib_0.8.0                
##  [5] Biobase_2.67.0              lattice_0.22-6             
##  [7] tzdb_0.4.0                  vctrs_0.6.5                
##  [9] tools_4.4.2                 bitops_1.0-9               
## [11] curl_6.0.1                  parallel_4.4.2             
## [13] tibble_3.2.1                pkgconfig_2.0.3            
## [15] Matrix_1.7-1                lifecycle_1.0.4            
## [17] GenomeInfoDbData_1.2.13     compiler_4.4.2             
## [19] stringr_1.5.1               Rsamtools_2.23.1           
## [21] codetools_0.2-20            htmltools_0.5.8.1          
## [23] sys_3.4.3                   buildtools_1.0.0           
## [25] sass_0.4.9                  RCurl_1.98-1.16            
## [27] yaml_2.3.10                 pillar_1.10.0              
## [29] crayon_1.5.3                jquerylib_0.1.4            
## [31] BiocParallel_1.41.0         cachem_1.1.0               
## [33] DelayedArray_0.33.3         abind_1.4-8                
## [35] tidyselect_1.2.1            digest_0.6.37              
## [37] stringi_1.8.4               restfulr_0.0.15            
## [39] maketools_1.3.1             fastmap_1.2.0              
## [41] grid_4.4.2                  cli_3.6.3                  
## [43] SparseArray_1.7.2           magrittr_2.0.3             
## [45] S4Arrays_1.7.1              XML_3.99-0.17              
## [47] readr_2.1.5                 UCSC.utils_1.3.0           
## [49] bit64_4.5.2                 rmarkdown_2.29             
## [51] httr_1.4.7                  matrixStats_1.4.1          
## [53] bit_4.5.0.1                 hms_1.1.3                  
## [55] evaluate_1.0.1              knitr_1.49                 
## [57] rlang_1.1.4                 glue_1.8.0                 
## [59] crisprBase_1.11.0           BiocManager_1.30.25        
## [61] vroom_1.6.5                 jsonlite_1.8.9             
## [63] R6_2.5.1                    MatrixGenerics_1.19.0      
## [65] GenomicAlignments_1.43.0    zlibbioc_1.52.0

References

Li, Heng, and Richard Durbin. 2009. “Fast and Accurate Short Read Alignment with Burrows–Wheeler Transform.” Bioinformatics 25 (14): 1754–60.