SynExtend started out as a package of tools for working
with objects of class Synteny built from the package
DECIPHER’s FindSynteny() function.
Synteny maps provide a powerful tool for quantifying and visualizing where pairs of genomes share ordered information. Typically these maps are built from predictions of orthologous feature pairs, where groups of pairs that provide contiguous and sequential blocks in their respective genomes are referred to as a ‘syntenic block’. The identification of syntenic blocks can then be used to further interrogate the predicted orthologs themselves, or query topics like genomic rearrangements, ancestral genome reconstruction, or shared functional neighborhoods.
FindSynteny takes a different approach, finding exactly
matched shared k-mers and determining where shared k-mers, or blocks of
proximally shared k-mers are significant. Combining the information
generated by FindSynteny with locations of genomic features
allows us to simply mark where features are linked by syntenic k-mers.
These linked features represent potential orthologous pairs, and can be
easily evaluated on the basis of the k-mers that they share, or
alignment.
Currently SynExtend contains a set of functions for
inferring orthologous relationships between features in different
genomes, as well as tools for predicting feature interaction
networks.
SynExtend with BiocManager:Using the FindSynteny function in DECIPHER
builds an object of class Synteny. In this tutorial, a
prebuilt DECIPHER database is used. For database
construction see ?Seqs2DB in DECIPHER. This
example starts with a database containing three endosymbiont genomes
that were chosen to keep package data to a minimum.
## Loading required package: DECIPHER
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following object is masked from 'package:utils':
##
## data
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: Seqinfo
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
##
## Attaching package: 'SynExtend'
## The following object is masked from 'package:stats':
##
## dendrapply
library(DBI)
file01 <- system.file("extdata",
"example_db.sqlite",
package = "SynExtend")
syn <- FindSynteny(dbFile = file01)## ================================================================================
##
## Time difference of 3.02 secs
Synteny maps represent where genomes share order. Simply printing a
synteny object to the console displays a gross level view of the data
inside. Objects of class Synteny can also be plotted to
provide clear visual representations of the data inside. The genomes
used in this example are distantly related and fairly dissimilar.
## 1 3 4
## 1 1 seq 18% hits 26% hits
## 3 50 blocks 1 seq 60% hits
## 4 66 blocks 54 blocks 1 seq
Data present inside objects of class Synteny can also be
accessed relatively easily. The object itself is functionally a matrix
of lists, with data describing exactly matched k-mers present in the
upper triangle, and data describing blocks of chained k-mers in the
lower triangle. For more information see ?FindSynteny in
the package DECIPHER.
The upper triangle of the synteny upbject is populated by the individual hits.
## index1 index2 strand width start1 start2 frame1 frame2
## [1,] 1 1 0 275 63033 219267 3 3
## [2,] 1 1 0 317 63810 219999 3 3
## [3,] 1 1 0 1031 64194 220364 3 2
## [4,] 1 1 0 210 65231 221395 3 2
## [5,] 1 1 0 1523 65481 221645 3 2
## [6,] 1 1 0 353 161688 105431 3 2
The lower triangle of the synteny object is populated by the block assignments of the hits.
## index1 index2 strand score start1 start2 end1 end2 first_hit last_hit
## [1,] 1 1 0 1914 63033 219267 67003 223167 1 5
## [2,] 1 1 0 1826 161688 105431 167990 111398 6 25
## [3,] 1 1 0 1751 707150 43365 710999 47149 26 32
## [4,] 1 1 0 1588 52605 1 55680 2978 33 40
## [5,] 1 1 0 1157 154168 97916 161487 104655 41 58
## [6,] 1 1 0 1116 314963 134777 318319 138024 59 66
To take advantage of these synteny maps, we can then overlay the gene calls for each genome present on top of our map.
Next, GFF annotations for the associated genomes are parsed to
provide gene calls in a use-able format. GFFs are not the only possible
source of appropriate gene calls, but they are the source that was used
during package construction and testing. GFFs are imported via
rtracklater::import, and converted to a DataFrame via
SquaregffBy. Gene calls for both the
SummarizePairs and NucleotideOverlap functions
must be a named list, and those names must all be present within
dimnames(syn)[[1]].
gr_obj <- rtracklayer::import(system.file("extdata",
"GCF_023585725.1_ASM2358572v1_genomic.gff.gz",
package = "SynExtend"))
genecalls_object <- SquaregffBy(gff_object = gr_obj,
verbose = TRUE)## parsing parent-child heirarchies...
## ================================================================================
## checking feature / contig bound conflicts...
## ================================================================================
## Time difference of 5.977178 secs
## DataFrame with 6 rows and 14 columns
## Index Strand Start Stop Type ID
## <integer> <integer> <integer> <integer> <character> <character>
## 1 1 1 2972 3853 gene gene-M9404_RS00010
## 2 1 0 4525 5064 gene gene-M9404_RS00015
## 3 1 0 5220 5295 gene gene-M9404_RS00020
## 4 1 0 5877 6365 gene gene-M9404_RS00025
## 5 1 0 6392 7630 gene gene-M9404_RS00030
## 6 1 0 7890 8804 gene gene-M9404_RS00035
## Range Product Coding Translation_Table
## <IRangesList> <character> <logical> <character>
## 1 2972-3853 archaetidylserine de.. TRUE 11
## 2 4525-5064 oligoribonuclease TRUE 11
## 3 5220-5295 tRNA-Gly FALSE NA
## 4 5877-6365 tRNA (adenosine(37)-.. TRUE 11
## 5 6392-7630 N-acetylmuramoyl-L-a.. TRUE 11
## 6 7890-8804 tRNA (adenosine(37)-.. TRUE 11
## Contig Dbxref Notes Phase
## <character> <CharacterList> <CharacterList> <IntegerList>
## 1 NZ_CP097750.1 GenBank:WP_250247045.1 0
## 2 NZ_CP097750.1 GenBank:WP_250247043.1 0
## 3 NZ_CP097750.1 0
## 4 NZ_CP097750.1 GenBank:WP_250247041.1 0
## 5 NZ_CP097750.1 GenBank:WP_250248113.1 0
## 6 NZ_CP097750.1 GenBank:WP_250247037.1 0
# in an effort to be space conscious, not all original gffs are kept within this package
genecalls <- get(data("genecalls", package = "SynExtend"))SynExtend’s gffToDataFrame function will
directly import gff files into a usable format, and includes other
extracted information.
## DataFrame with 6 rows and 14 columns
## Index Strand Start Stop Type ID
## <integer> <integer> <integer> <integer> <character> <character>
## 1 1 1 2972 3853 gene gene-M9404_RS00010
## 2 1 0 4525 5064 gene gene-M9404_RS00015
## 3 1 0 5220 5295 gene gene-M9404_RS00020
## 4 1 0 5877 6365 gene gene-M9404_RS00025
## 5 1 0 6392 7630 gene gene-M9404_RS00030
## 6 1 0 7890 8804 gene gene-M9404_RS00035
## Range Product Coding Translation_Table
## <IRangesList> <character> <logical> <character>
## 1 2972-3853 archaetidylserine de.. TRUE 11
## 2 4525-5064 oligoribonuclease TRUE 11
## 3 5220-5295 tRNA-Gly FALSE NA
## 4 5877-6365 tRNA (adenosine(37)-.. TRUE 11
## 5 6392-7630 N-acetylmuramoyl-L-a.. TRUE 11
## 6 7890-8804 tRNA (adenosine(37)-.. TRUE 11
## Contig Dbxref Notes Phase
## <character> <CharacterList> <CharacterList> <IntegerList>
## 1 NZ_CP097750.1 GenBank:WP_250247045.1 0
## 2 NZ_CP097750.1 GenBank:WP_250247043.1 0
## 3 NZ_CP097750.1 0
## 4 NZ_CP097750.1 GenBank:WP_250247041.1 0
## 5 NZ_CP097750.1 GenBank:WP_250248113.1 0
## 6 NZ_CP097750.1 GenBank:WP_250247037.1 0
SynExtend’s primary functions provide a way to identify
where pairs of genes are explicitly linked by syntenic hits, and then
summarize those links. The first step is just identifying those links.
We’re appending some funny business before generating the links because
the example database that was loaded above is shipped without
amino acid sequences to be conscious about package size.
tmp_db <- tempfile()
system(command = paste("cp",
file01,
tmp_db))
drv <- dbDriver("SQLite")
conn01 <- dbConnect(drv = drv,
tmp_db)
l01 <- NucleotideOverlap(SyntenyObject = syn,
GeneCalls = genecalls,
Verbose = TRUE)##
## Reconciling genecalls.
## ================================================================================
## Finding connected features.
## ================================================================================
## Time difference of 0.2642477 secs
The l01 object generated by NucleotideOverlap is a raw
representation of positions on the synteny map where shared k-mers link
genes between paired genomes. As such, it is analagous in shape to
objects of class Synteny. This raw object is unlikely to be
useful to most users, but has been left exposed to ensure that this data
remains accessible should it be advantageous.
## QueryGene SubjectGene ExactOverlap QueryIndex SubjectIndex QLeftPos
## [1,] 24 105 332 1 1 27166
## [2,] 25 185 287 1 1 28116
## [3,] 27 186 1343 1 1 29879
## [4,] 33 189 844 1 1 35460
## [5,] 34 190 1007 1 1 37977
## [6,] 38 136 1131 1 1 42467
## QRightPos SLeftPos SRightPos MaxKmerSize TotalKmerHits
## [1,] 27497 113974 114305 332 1
## [2,] 28402 223609 223895 287 1
## [3,] 31221 224340 225700 767 3
## [4,] 36319 226993 227852 476 2
## [5,] 39466 228899 230364 353 4
## [6,] 43626 155470 156635 320 4
## QueryGene SubjectGene ExactOverlap QueryIndex SubjectIndex QLeftPos
## [1,] 24 105 332 1 1 27166
## [2,] 25 185 287 1 1 28116
## [3,] 27 186 150 1 1 29879
## [4,] 27 186 426 1 1 30029
## [5,] 27 186 767 1 1 30455
## [6,] 33 189 368 1 1 35460
## QRightPos SLeftPos SRightPos QuerySubKey SubjectSubKey SyntenicOrigin
## [1,] 27497 113974 114305 1 1 154
## [2,] 28402 223609 223895 1 1 159
## [3,] 30028 224340 224489 1 1 81
## [4,] 30454 224502 224927 1 1 82
## [5,] 31221 224934 225700 1 1 83
## [6,] 35827 226993 227360 1 1 122
This raw data can be processed to provide a straightforward summary of candidate pairs.
## Collecting pairs.
## ================================================================================
## Time difference of 4.898498 secs
The object p01 is a data.frame where each row is
populated by information about a candidate orthologous pair.
SummarizePairs is just summarizing the pairs identified by
NucleotideOverlap, and offers no direct assignment of
accuracy or truthiness. We can pass the object created by it to
EvaluatePairs to get a sense of which candidates are
acceptable to keep. EvaluatePairs has an un-intuitive mix
of functionalities, but in its most basic form, it will generate a decoy
data set is FALSE pairs, and drop initial candidates that
appear below a False Discovery Rate threshold
p02 <- EvaluatePairs(InputPairs = p01,
DataBase01 = conn01,
EvaluationMethod = "none",
FDRCriteria = c("Delta_Background" = 0.001)) # this is the default behaviorSynExtend also has a variety of other helper functions and a suite of other tools. This is the most basic usage of the tools present here.
Session Info:
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] DBI_1.3.0 SynExtend_1.25.1 DECIPHER_3.9.0
## [4] Biostrings_2.81.2 Seqinfo_1.3.0 XVector_0.53.0
## [7] IRanges_2.47.1 S4Vectors_0.51.2 BiocGenerics_0.59.3
## [10] generics_0.1.4 BiocStyle_2.41.0
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.43.0 rjson_0.2.23
## [3] xfun_0.57 bslib_0.11.0
## [5] Biobase_2.73.1 lattice_0.22-9
## [7] vctrs_0.7.3 tools_4.6.0
## [9] bitops_1.0-9 curl_7.1.0
## [11] parallel_4.6.0 RSQLite_3.53.1
## [13] blob_1.3.0 pkgconfig_2.0.3
## [15] BiocBaseUtils_1.15.1 Matrix_1.7-5
## [17] cigarillo_1.3.0 lifecycle_1.0.5
## [19] compiler_4.6.0 Rsamtools_2.29.0
## [21] codetools_0.2-20 htmltools_0.5.9
## [23] sys_3.4.3 buildtools_1.0.0
## [25] sass_0.4.10 RCurl_1.98-1.18
## [27] yaml_2.3.12 crayon_1.5.3
## [29] jquerylib_0.1.4 BiocParallel_1.47.0
## [31] cachem_1.1.0 DelayedArray_0.39.2
## [33] abind_1.4-8 digest_0.6.39
## [35] restfulr_0.0.16 maketools_1.3.2
## [37] fastmap_1.2.0 grid_4.6.0
## [39] SparseArray_1.13.2 cli_3.6.6
## [41] S4Arrays_1.13.0 XML_3.99-0.23
## [43] bit64_4.8.2 rmarkdown_2.31
## [45] httr_1.4.8 matrixStats_1.5.0
## [47] bit_4.6.0 memoise_2.0.1
## [49] evaluate_1.0.5 knitr_1.51
## [51] GenomicRanges_1.65.0 BiocIO_1.23.3
## [53] rtracklayer_1.73.0 rlang_1.2.0
## [55] BiocManager_1.30.27 jsonlite_2.0.0
## [57] R6_2.6.1 MatrixGenerics_1.25.0
## [59] GenomicAlignments_1.49.0