SynExtend
is a package of tools for working with objects
of class Synteny
built from the package
DECIPHER
’s FindSynteny()
function.
Synteny maps provide a powerful tool for quantifying and visualizing where pairs of genomes share order. Typically these maps are built from predictions of orthologous pairs, where groups of pairs that provide contiguous and sequential blocks in their respective genomes are deemed a ‘syntenic block’. That designation of synteny can then used to further interrogate the predicted orthologs themselves, or query topics like genomic rearrangements or ancestor genome reconstruction.
FindSynteny
takes a different approach, finding exactly
matched shared k-mers and determining where shared k-mers, or blocks of
proximate shared k-mers are significant. Combining the information
generated by FindSynteny
with locations of genomic features
allows us to simply mark where features are linked by syntenic k-mers.
These linked features represent potential orthologous pairs, and can be
easily evaluated on the basis of the k-mers that they share, or
alignment.
Currently SynExtend
contains one set of functions for
performig orthology predictions, as well as a rearrangement estimation
function that is currently under construction.
SynExtend
in R by running the following
commands:Using the FindSynteny
function in DECIPHER
builds an object of class Synteny
. In this tutorial, a
prebuilt DECIPHER
database is used. For database
construction see ?Seqs2DB
in DECIPHER
. This
example starts with a database containing three endosymbiont genomes
that were chosen to keep package data to a minimum.
## Loading required package: DECIPHER
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:generics':
##
## intersect, setdiff, setequal, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff,
## setequal, table, tapply, union, unique, unsplit, which.max,
## which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
##
## Attaching package: 'SynExtend'
## The following object is masked from 'package:stats':
##
## dendrapply
DBPATH <- system.file("extdata",
"Endosymbionts_v02.sqlite",
package = "SynExtend")
Syn <- FindSynteny(dbFile = DBPATH)
## ================================================================================
##
## Time difference of 8.59 secs
Synteny maps represent where genomes share order. Simply printing a
synteny object to the console displays a gross level view of the data
inside. Objects of class Synteny
can also be plotted to
provide clear visual representations of the data inside. The genomes
used in this example are distantly related and fairly dissimilar.
## 1 2 3 4
## 1 1 seq 99% hits 98% hits 89% hits
## 2 65 blocks 1 seq 99% hits 88% hits
## 3 62 blocks 38 blocks 1 seq 89% hits
## 4 137 blocks 132 blocks 143 blocks 1 seq
Data present inside objects of class Synteny
can also be
accessed relatively easily. The object itself is functionally a matrix
of lists, with data describing exactly matched k-mers present in the
upper triangle, and data describing blocks of chained k-mers in the
lower triangle. For more information see ?FindSynteny
in
the package DECIPHER
.
## index1 index2 strand width start1 start2 frame1 frame2
## [1,] 1 1 1 12894 703704 460240 3 2
## [2,] 1 1 1 7693 716599 447346 2 3
## [3,] 1 1 1 94 724295 439652 0 0
## [4,] 1 1 1 1201 724390 439556 0 0
## [5,] 1 1 1 5910 725605 438338 1 1
## [6,] 1 1 1 40 731657 432285 0 0
## index1 index2 strand score start1 start2 end1 end2 first_hit last_hit
## [1,] 1 1 1 40531 703704 410323 753607 460240 1 22
## [2,] 1 1 1 27986 147272 148815 224367 225900 23 54
## [3,] 1 1 1 25605 570380 556462 607479 593557 55 67
## [4,] 1 1 1 23545 43617 309162 64086 329621 68 77
## [5,] 1 1 1 22183 373097 759900 404044 790839 78 95
## [6,] 1 1 1 21198 675314 463304 700637 488627 96 105
The above printed objects show the data for the comparison between the first and second genome in our database.
To take advantage of these synteny maps, we can then overlay the gene calls for each genome present on top of our map.
Next, GFF annotations for the associated genomes are parsed to
provide gene calls in a use-able format. GFFs are not the only possible
source of appropriate gene calls, but they are the source that was used
during package construction and testing. Parsed GFFs can be constructed
with gffToDataFrame
, for full functionality, or GFFs can be
imported via rtracklater::import()
for limited
functionality. GeneCalls for both the PairSummaries
and
NucleotideOverlap
functions must be named list, and those
names must match dimnames(Syn)[[1]]
.
# generating genecalls with local data:
GC <- gffToDataFrame(GFF = system.file("extdata",
"GCF_021065005.1_ASM2106500v1_genomic.gff.gz",
package = "SynExtend"),
Verbose = TRUE)
## ================================================================================
## Time difference of 3.837548 secs
# in an effort to be space conscious, not all original gffs are kept within this package
GeneCalls <- get(data("Endosymbionts_GeneCalls", package = "SynExtend"))
SynExtend
’s gffToDataFrame
function will
directly import gff files into a usable format, and includes other
extracted information.
## DataFrame with 6 rows and 11 columns
## Index Strand Start Stop Type ID
## <integer> <integer> <integer> <integer> <character> <character>
## 1 1 0 170 493 gene gene-M9394_RS00005
## 2 1 0 574 1935 gene gene-M9394_RS00010
## 3 1 0 1995 2522 gene gene-M9394_RS00015
## 4 1 0 2581 3321 gene gene-M9394_RS00020
## 5 1 1 4383 5396 gene gene-M9394_RS00025
## 6 1 0 5682 6377 gene gene-M9394_RS00030
## Range Product Coding Translation_Table
## <IRangesList> <character> <logical> <character>
## 1 170-493 cell envelope integr.. TRUE 11
## 2 574-1935 Tol-Pal system beta .. TRUE 11
## 3 1995-2522 peptidoglycan-associ.. TRUE 11
## 4 2581-3321 tol-pal system prote.. TRUE 11
## 5 4383-5396 6-phosphogluconolact.. TRUE 11
## 6 5682-6377 2%2C3-diphosphoglyce.. TRUE 11
## Contig
## <character>
## 1 NZ_CP097751.1
## 2 NZ_CP097751.1
## 3 NZ_CP097751.1
## 4 NZ_CP097751.1
## 5 NZ_CP097751.1
## 6 NZ_CP097751.1
Raw GFF imports are also acceptable, but prevent alignments in amino
acid space with PairSummaries()
.
X01 <- rtracklayer::import(system.file("extdata",
"GCA_000875775.1_ASM87577v1_genomic.gff.gz",
package = "SynExtend"))
class(X01)
print(X01)
SynExtend
’s primary functions provide a way to identify
where pairs of genes are explicitly linked by syntenic hits, and then
summarize those links. The first step is just identifying those
links.
##
## Reconciling genecalls.
## ================================================================================
## Finding connected features.
## ================================================================================
## Time difference of 0.3694351 secs
The Links
object generated by NucleotideOverlap is a raw
representation of positions on the synteny map where shared k-mers link
genes between paired genomes. As such, it is analagous in shape to
objects of class Synteny
. This raw object is unlikely to be
useful to most users, but has been left exposed to ensure that this data
remains accessible should a user desire to have access to it.
## [1] "LinkedPairs"
## 1 2 3 4
## 1 1 Contig 709 Pairs 709 Pairs 717 Pairs
## 2 754 Kmers 1 Contig 708 Pairs 715 Pairs
## 3 754 Kmers 728 Kmers 1 Contig 711 Pairs
## 4 1345 Kmers 1479 Kmers 1343 Kmers 1 Contig
This raw data can be processed to provide a straightforward summary of predicted pairs.
##
## Preparing overhead data.
## Overhead complete.
## Collecting pairs.
## ================================================================================
## Time difference of 1.849932 secs
The object LinkedPairs1
is a data.frame where each row
is populated by information about a predicted orthologous pair. By
default PairSummaries
uses a simple model to determine
whether the k-mers that link a pair of genes are likely to provide an
erroneous link. When set to Model = "Global"
, is is simply
a prediction of whether the involved nucleotides are likely to describe
a pair of genomic features whose alignment would result in a PID that
falls within a random distribution. This model is effective if somewhat
permissive, but is significantly faster than performing many pairwise
alignments.
## p1 p2 ExactMatch Consensus TotalKmers MaxKmer p1FeatureLength
## 1 1_1_1 2_1_310 324 0.7727273 1 324 324
## 2 1_1_2 2_1_309 1362 1.0000000 1 1362 1362
## 3 1_1_3 2_1_308 528 1.0000000 1 528 528
## 4 1_1_4 2_1_307 741 1.0000000 1 741 741
## 5 1_1_5 2_1_306 1014 1.0000000 1 1014 1014
## 6 1_1_6 2_1_305 696 1.0000000 1 696 696
## p2FeatureLength Adjacent TetDist PIDType PredictedPID
## 1 594 1 0.043957492 AA 0.4548953
## 2 1362 2 0.002943341 AA 0.9999764
## 3 528 2 0.005387480 AA 0.9982457
## 4 741 2 0.007901019 AA 0.9995543
## 5 1014 2 0.005767509 AA 0.9999043
## 6 696 2 0.000000000 AA 0.9995661
PairSummaries includes arguments that allow for aligning all pairs
that are predicted, via PIDs = TRUE
, while
IgnoreDefaultStringSet = FALSE
indicates that alignments
should be performed in nucleotide or amino acid space as is appropriate
for the linked sequences. Setting
IgnoreDefaultStringSet = TRUE
will force all alignments
into nucleotide space.
As of SynExtend v 1.3.13, the functions ExtractBy
and
DisjointSet
have been added to provide users with direct
tools to work with PairSummaries
objects.
##
## Assigning initial root:
## ================================================================================
##
## Time difference of 0.03834152 secs
##
## Assigning final root:
## ================================================================================
##
## Time difference of 0.02311158 secs
##
## Assigning single linkage clusters.
## Assignments complete.
##
## Time difference of 0.06647491 secs
# extract the first 10 clusters
Sets <- ExtractBy(x = LinkedPairs1,
y = DBPATH,
z = SingleLinkageClusters[1:10],
Verbose = TRUE)
##
## Extracting Sequences:
## ================================================================================
##
## Arranging Sequences:
## ================================================================================
##
## Time difference of 0.1657987 secs
## [[1]]
## DNAStringSet object of length 6:
## width seq names
## [1] 324 TTGACAACAAAAAAAAATGATAA...TATTAAATTTTTCTCCTCAATAA 1_1_1
## [2] 411 GTGATCGTTTCTTTGTTGCTATA...AAAAGAAAAATCAAGATCAATAA 1_1_663
## [3] 135 TTGCCCAACACCCCTAACAAGAA...ATCAGGTTGCCCAACACCCCTAA 1_1_664
## [4] 594 TTGGCAGGACCAACGTTGACCAA...TATTAAATTTTTCTCCTCAATAA 2_1_310
## [5] 324 TTGACAACAAAAAAAAATGATAA...TATTAAATTTTTCTCCTCAATAA 3_1_272
## [6] 1059 ATGATATCAATCATTGTACATGT...TATTAAATTTTTCTCCTCAATAA 4_1_442
##
## [[2]]
## DNAStringSet object of length 4:
## width seq names
## [1] 1362 ATGTATTTAATAATTAGAAAAAC...GGTCCCCATTACATTTAGAATAA 1_1_2
## [2] 1362 ATGTATTTAATAATTAGAAAAAC...GGTCCCCATTACATTTAGAATAA 2_1_309
## [3] 1362 ATGTATTTAATAATTAGAAAAAC...GGTCCCCATTACATTTAGAATAA 3_1_273
## [4] 1296 ATGTTAATTATATTTTTATGGAA...GGTCTCCATTACATTTAGAATAA 4_1_443
##
## [[3]]
## DNAStringSet object of length 4:
## width seq names
## [1] 528 ATGAAACCAAATAATTTTTTTAA...GTGTTGTATTAATATATAAATAA 1_1_3
## [2] 528 ATGAAACCAAATAATTTTTTTAA...GTGTTGTATTAATATATAAATAA 2_1_308
## [3] 528 ATGAAACCAAATAATTTTTTTAA...GTGTTGTATTAATATATAAATAA 3_1_274
## [4] 528 ATGAAATCAAACAATTTTTTTAA...GCGTTGTATTAATATATAAATAA 4_1_444
##
## [[4]]
## DNAStringSet object of length 4:
## width seq names
## [1] 741 ATGATGATATATTTTAATATTAC...AACGTCTAATACACCTATCATAA 1_1_4
## [2] 741 ATGATGATATATTTTAATATTAC...AACGTCTAATACACCTATCATAA 2_1_307
## [3] 741 ATGATGATATATTTTAATATTAC...AACGTCTAATACACCTATCATAA 3_1_275
## [4] 738 ATGATATATTTTAATATTACAAA...AACGTCTAATACACCTATCATAA 4_1_445
##
## [[5]]
## DNAStringSet object of length 4:
## width seq names
## [1] 1014 ATGATACAAATTGTTTATGTAGT...TTATGGCGTTGAATTGCAAATGA 1_1_5
## [2] 1014 ATGATACAGATTGTTTATGTAGT...TTATGGCGTTGAATTGCAAATGA 2_1_306
## [3] 1014 ATGATACAAATTGTTTATGTAGT...TTATGGCGTTGAATTGCAAATGA 3_1_276
## [4] 1014 ATGATACAAATTATTTATGTAGC...TTATTACGTTGAATTGCAAATGA 4_1_446
##
## [[6]]
## DNAStringSet object of length 4:
## width seq names
## [1] 696 ATGCATATAACTAAATTAGTTTT...TTCAACATTATTATTTAAAATAA 1_1_6
## [2] 696 ATGCATATAACTAAATTAGTTTT...TTCAACATTATTATTTAAAATAA 2_1_305
## [3] 696 ATGCATATAACTAAATTAGTTTT...TTCAACATTATTATTTAAAATAA 3_1_277
## [4] 699 ATGCATATAACTAAATTAGTTTT...TTCAACATTATTATTTAAAATAA 4_1_447
Session Info:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SynExtend_1.19.0 DECIPHER_3.3.0 Biostrings_2.75.0
## [4] GenomeInfoDb_1.43.0 XVector_0.46.0 IRanges_2.41.0
## [7] S4Vectors_0.44.0 BiocGenerics_0.53.1 generics_0.1.3
## [10] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] bit_4.5.0 jsonlite_1.8.9 highr_0.11
## [4] compiler_4.4.1 BiocManager_1.30.25 crayon_1.5.3
## [7] blob_1.2.4 parallel_4.4.1 jquerylib_0.1.4
## [10] yaml_2.3.10 fastmap_1.2.0 R6_2.5.1
## [13] knitr_1.48 maketools_1.3.1 GenomeInfoDbData_1.2.13
## [16] DBI_1.2.3 bslib_0.8.0 rlang_1.1.4
## [19] cachem_1.1.0 xfun_0.48 sass_0.4.9
## [22] sys_3.4.3 bit64_4.5.2 memoise_2.0.1
## [25] RSQLite_2.3.7 cli_3.6.3 zlibbioc_1.52.0
## [28] digest_0.6.37 lifecycle_1.0.4 vctrs_0.6.5
## [31] evaluate_1.0.1 buildtools_1.0.0 rmarkdown_2.28
## [34] httr_1.4.7 pkgconfig_2.0.3 tools_4.4.1
## [37] htmltools_0.5.8.1 UCSC.utils_1.2.0