ggmsa is a package designed to plot multiple sequence alignments.
This package implements functions to visualize publication-quality multiple sequence alignments (protein/DNA/RNA) in R extremely simple and powerful. It uses module design to annotate sequence alignments and allows to accept other data sets for diagrams combination.
In this tutorial, we’ll work through the basics of using ggmsa.
We’ll start by importing some example data to use throughout this
tutorial. Expect FASTA files, some of the objects in R can also as
input. available_msa()
can be used to list MSA objects
currently available.
available_msa()
#> 1.files currently available:
#> .fasta
#> 2.XStringSet objects from 'Biostrings' package:
#> DNAStringSet RNAStringSet AAStringSet BStringSet DNAMultipleAlignment RNAMultipleAlignment AAMultipleAlignment
#> 3.bin objects:
#> DNAbin AAbin
protein_sequences <- system.file("extdata", "sample.fasta",
package = "ggmsa")
miRNA_sequences <- system.file("extdata", "seedSample.fa",
package = "ggmsa")
nt_sequences <- system.file("extdata", "LeaderRepeat_All.fa",
package = "ggmsa")
The most simple code to use ggmsa:
ggmsa(protein_sequences, 300, 350, color = "Clustal",
font = "DroidSansMono", char_width = 0.5, seq_name = TRUE )
ggmsa predefines several color schemes for rendering MSA are shipped
in the package. In the same ways, using available_msa()
to
list color schemes currently available. Note that amino acids (protein)
and nucleotides (DNA/RNA) have different names.
ggmsa supports annotations for MSA. Similar to the ggplot2, it
implements annotations by geom
and users can perform
annotation with +
, like this:
ggmsa() + geom_*()
. Automatically generated annotations
that containing colored labels and symbols are overlaid on MSAs to
indicate potentially conserved or divergent regions.
For example, visualizing multiple sequence alignment with sequence logo and bar chart:
ggmsa(protein_sequences, 221, 280, seq_name = TRUE, char_width = 0.5) +
geom_seqlogo(color = "Chemistry_AA") + geom_msaBar()
This table shows the annnotation layers supported by ggmsa as following:
Annotation modules | Type | Description |
---|---|---|
geom_seqlogo() | geometric layer | automatically generated sequence logos for a MSA |
geom_GC() | annotation module | shows GC content with bubble chart |
geom_seed() | annotation module | highlights seed region on miRNA sequences |
geom_msaBar() | annotation module | shows sequences conservation by a bar chart |
geom_helix() | annotation module | depicts RNA secondary structure as arc diagrams(need extra data) |
Check out the guides for learning everything there is to know about all the different features:
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] kableExtra_1.4.0 ggplot2_3.5.1 ggmsa_1.13.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 viridisLite_0.4.2 dplyr_1.1.4
#> [4] farver_2.1.2 Biostrings_2.75.0 fastmap_1.2.0
#> [7] lazyeval_0.2.2 ash_1.0-15 tweenr_2.0.3
#> [10] digest_0.6.37 R4RNA_1.33.0 lifecycle_1.0.4
#> [13] tidytree_0.4.6 magrittr_2.0.3 compiler_4.4.1
#> [16] rlang_1.1.4 sass_0.4.9 tools_4.4.1
#> [19] utf8_1.2.4 yaml_2.3.10 knitr_1.48
#> [22] labeling_0.4.3 xml2_1.3.6 RColorBrewer_1.1-3
#> [25] aplot_0.2.3 KernSmooth_2.23-24 withr_3.0.2
#> [28] purrr_1.0.2 BiocGenerics_0.53.0 sys_3.4.3
#> [31] grid_4.4.1 polyclip_1.10-7 proj4_1.0-14
#> [34] stats4_4.4.1 fansi_1.0.6 colorspace_2.1-1
#> [37] extrafontdb_1.0 scales_1.3.0 seqmagick_0.1.7
#> [40] MASS_7.3-61 cli_3.6.3 rmarkdown_2.28
#> [43] crayon_1.5.3 treeio_1.29.2 generics_0.1.3
#> [46] rstudioapi_0.17.1 ggtree_3.13.2 httr_1.4.7
#> [49] ape_5.8 cachem_1.1.0 ggforce_0.4.2
#> [52] stringr_1.5.1 zlibbioc_1.51.2 maps_3.4.2
#> [55] ggalt_0.4.0 parallel_4.4.1 ggplotify_0.1.2
#> [58] BiocManager_1.30.25 XVector_0.45.0 yulab.utils_0.1.7
#> [61] vctrs_0.6.5 jsonlite_1.8.9 gridGraphics_0.5-1
#> [64] IRanges_2.39.2 patchwork_1.3.0 S4Vectors_0.43.2
#> [67] systemfonts_1.1.0 maketools_1.3.1 jquerylib_0.1.4
#> [70] tidyr_1.3.1 glue_1.8.0 stringi_1.8.4
#> [73] gtable_0.3.6 GenomeInfoDb_1.41.2 UCSC.utils_1.1.0
#> [76] extrafont_0.19 munsell_0.5.1 tibble_3.2.1
#> [79] pillar_1.9.0 htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
#> [82] R6_2.5.1 evaluate_1.0.1 lattice_0.22-6
#> [85] highr_0.11 ggfun_0.1.7 bslib_0.8.0
#> [88] Rcpp_1.0.13 svglite_2.1.3 nlme_3.1-166
#> [91] Rttf2pt1_1.3.12 xfun_0.48 fs_1.6.4
#> [94] buildtools_1.0.0 pkgconfig_2.0.3