Showing small differences between two long strings, such as DNA or AA
sequences is challenging, especially in R. Typically, DNA or AA sequence
alignments show all characters in a sequence. The package ggmsa does
this really well and is compatible with ggplot2. However, this is not
viable for sequences over a certain length.
Alternatively, top level visualizations may, for example, represent
degree of variation over the length in a line plot, making it possible
to gauge how strongly sequences differ, but not the quality of the
difference. The intention with this package is to provide a way to
visualize sequence alignments over the whole length of arbitrarily long
sequences without losing the ability to show small differences, see
figure @ref(fig:showcase).
Until the next major version of Bioconductor (expected October 2024),
ggseqalign
can be installed from the Devel
version of Bioconductor.
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install(version = "devel")
BiocManager::valid() # checks for out of date packages
BiocManager::install("ggseqalign")
See the BiocManager vignette for instructions on using multiple versions of Bioconductor.
ggseqalign
can also be installed from it’s original
source on GitHub (requires devtools
)
This package relies on two core functions,
alignment_table()
and
plot_sequence_alignment()
. At its core, the former uses
PairwiseAlignment()
, previously in Biostrings,
now in pwalign,
to align one or several query strings to a subject string to parse all
information on mismatches, insertions and deletions into a table that is
used as the input for plotting with
plot_sequence_alignment()
.
A minimal example:
library(ggseqalign)
library(ggplot2)
query_strings <- (c("boo", "fibububuzz", "bozz", "baofuzz"))
subject_string <- "boofizz"
alignment <- alignment_table(query_strings, subject_string)
plot_sequence_alignment(alignment) +
theme(text = element_text(size = 15))
This package is fully compatible with DNAStringSet
and
AAStringSet
classes from Biostrings,
an efficient and powerful way to handle sequence data. The two examples
below use example data from the Biostrings
package and requires it to be installed. To install Biostrings,
enter
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Biostrings")
This chunk demonstrates reading sequence data from a FASTA file into
a DNAStringSet
-class object and aligning it to a manually
created DNAStringSet
-class object.
library(ggseqalign)
library(Biostrings)
library(ggplot2)
query_sequences <- Biostrings::readDNAStringSet(system.file("extdata",
"fastaEx.fa",
package = "Biostrings"
))
subject_sequence <- DNAStringSet(paste0("AAACGATCGATCGTAGTCGACTGATGT",
"AGTATATACGTCGTACGTAGCATCGTC",
"AGTTACTGCATGCCGG"))
alignment <- alignment_table(query_sequences, subject_sequence)
plot_sequence_alignment(alignment) +
theme(text = element_text(size = 15))
The plots that plot_sequence_alignment()
generates can
become hard to read if there are too many differences, see
fig. @ref(fig:noisefig). The package allows to hide character mismatches
to preserve legibility of structural differences
(fig. @ref(fig:noisefignolab)).
# load
dna <- Biostrings::readDNAStringSet(system.file("extdata",
"dm3_upstream2000.fa.gz",
package = "Biostrings"
))
q <- as(
c(substr(dna[[1]], 100, 300)),
"DNAStringSet"
)
s <- as(
c(substr(dna[[2]], 100, 300)),
"DNAStringSet"
)
names(q) <- c("noisy alignment")
names(s) <- "reference"
plot_sequence_alignment(alignment_table(q, s)) +
theme(text = element_text(size = 15))
plot_sequence_alignment(alignment_table(q, s), hide_mismatches = TRUE) +
theme(text = element_text(size = 15))
Since plot_sequence_alignment()
produces a ggplot-class
object, all aspects of the plots can be modified with ggplot2
functions, such as theme()
. As an example, let’s recreate
and modify figure @ref(fig:showcase).
library(ggseqalign)
library(ggplot2)
library(Biostrings)
dna <- readDNAStringSet(system.file("extdata", "dm3_upstream2000.fa.gz",
package = "Biostrings"
))
q <- dna[2:4]
s <- dna[5]
q[1] <- as(
replaceLetterAt(q[[1]], c(5, 200, 400), "AGC"),
"DNAStringSet"
)
q[2] <- as(
c(substr(q[[2]], 300, 1500), substr(q[[2]], 1800, 2000)),
"DNAStringSet"
)
q[3] <- as(
replaceAt(
q[[3]], 1500,
paste(rep("A", 1000), collapse = "")
),
"DNAStringSet"
)
names(q) <- c("mismatches", "deletions", "insertion")
names(s) <- "reference"
pl <- plot_sequence_alignment(alignment_table(q, s))
pl <- pl +
ylab("Sequence variants") +
xlab("Length in bp") +
scale_color_viridis_d() +
theme(
text = element_text(size = 20),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
axis.title = element_text()
)
pl
Some modifications may require digging into the plot object layers,
this can get finicky but is possible. We can use pl$layers
to get a summary of the object’s layers. In this case, the geom_point
layers that plot the dots for mismatches are entry 8 in the layer list,
the white bar that indicates deletions is usually in layer 2. You may
want to change the deletion bar’s color if you use another plot
background color. This code chunk modifies the pl
object
from the previous chunk; the above chunk has to be run prior to this
one.
# Define background color
bg <- "grey90"
# Change plot background
pl <- pl + theme(panel.background = element_rect(
fill = bg,
colour = bg
))
# Match deletion to background
pl$layers[[2]]$aes_params$colour <- bg
# Increase mismatch indicator size and change shape
pl$layers[[8]]$aes_params$size <- 2
pl$layers[[8]]$aes_params$shape <- 4
pl$layers[[8]]$aes_params$colour <- "black"
pl
Any additional parameters to alignment_table()
are
passed on to pwalign::pairwiseAlignment()
, check pwalign
for a comprehensive overview over the available options. As a simple
example, we may increase gap penalties for the alignment in
@ref(fig:minimal-example).
library(ggseqalign)
library(ggplot2)
query_strings <- (c("boo", "fibububuzz", "bozz", "baofuzz"))
subject_string <- "boofizz"
alignment <- alignment_table(query_strings, subject_string, gapOpening = 20)
plot_sequence_alignment(alignment) +
theme(text = element_text(size = 15))
The output in this vignette was produced under the following conditions:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggplot2_3.5.1 ggseqalign_1.1.0 Biostrings_2.75.0
## [4] GenomeInfoDb_1.41.2 XVector_0.45.0 IRanges_2.39.2
## [7] S4Vectors_0.43.2 BiocGenerics_0.53.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3
## [4] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.1
## [7] grid_4.4.1 fastmap_1.2.0 jsonlite_1.8.9
## [10] BiocManager_1.30.25 httr_1.4.7 fansi_1.0.6
## [13] viridisLite_0.4.2 UCSC.utils_1.1.0 scales_1.3.0
## [16] jquerylib_0.1.4 cli_3.6.3 rlang_1.1.4
## [19] crayon_1.5.3 munsell_0.5.1 withr_3.0.2
## [22] cachem_1.1.0 yaml_2.3.10 tools_4.4.1
## [25] dplyr_1.1.4 colorspace_2.1-1 GenomeInfoDbData_1.2.13
## [28] buildtools_1.0.0 vctrs_0.6.5 R6_2.5.1
## [31] lifecycle_1.0.4 zlibbioc_1.51.2 pwalign_1.1.0
## [34] pkgconfig_2.0.3 pillar_1.9.0 bslib_0.8.0
## [37] gtable_0.3.6 glue_1.8.0 highr_0.11
## [40] xfun_0.48 tibble_3.2.1 tidyselect_1.2.1
## [43] sys_3.4.3 knitr_1.48 farver_2.1.2
## [46] htmltools_0.5.8.1 rmarkdown_2.28 maketools_1.3.1
## [49] labeling_0.4.3 compiler_4.4.1
The research and data generation that was a major motivation for me to finally create this package has received funding from the Norwegian Financial Mechanism 2014-2021, project DivGene: UMO-2019/34/H/NZ9/00559