Assessing synteny identification

Introduction

Synteny analysis allows the identification of conserved gene content and gene order (collinearity) in a genomic segment, and it is often used to study how genomic rearrangements have shaped genomes during the course of evolution. However, accurate detection of syntenic blocks is highly dependent on parameters such as minimum number of anchors, and maximum number of upstream and downstream genes to search for syntenic blocks. Zhao and Schranz (2019) proposed a network-based synteny analysis (algorithm now implemented in the Bioconductor package syntenet) that allows the identification of optimal parameters using the network’s average clustering coefficient and number of nodes. Here, we slightly modified the approach to also take into account how well the network’s degree distribution fits a scale-free topology, which is a typical property of biological networks. This method allows users to identify the best combination of parameters for synteny detection and synteny network inference.

Installation

To install the package from Bioconductor, use the following code:

if(!requireNamespace('BiocManager', quietly = TRUE))
  install.packages('BiocManager')
BiocManager::install("cogeqc")

Loading the package after installtion:

# Load package after installation
library(cogeqc)
set.seed(123) # for reproducibility

Data description

Here, we will use a subset of the synteny network inferred in Zhao and Schranz (2019) that contains the synteny network for Brassica oleraceae, B. napus, and B. rapa.

# Load synteny network for 
data(synnet)

head(synnet)
#>             anchor1        anchor2
#> 1 bnp_BnaA01g05780D bol_Bo1g011310
#> 2 bnp_BnaA01g05800D bol_Bo1g011320
#> 3 bnp_BnaA01g05810D bol_Bo1g011330
#> 4 bnp_BnaA01g05820D bol_Bo1g011340
#> 5 bnp_BnaA01g05830D bol_Bo1g011350
#> 6 bnp_BnaA01g05840D bol_Bo1g011360

Network-based assessment of synteny identification

To assess synteny detection, we calculate a synteny network score as follows:

$$ \begin{aligned} Score &= C N R^2_{SFT} \end{aligned} $$

where C is the network’s clustering coefficient, N is the number of nodes, and RSFT2 is the coefficient of determination for the scale-free topology fit.

The network with the highest score is considered the most accurate. To score a network, you will use the function assess_synnet().

assess_synnet(synnet)
#>         CC Node_count  Rsquared    Score
#> 1 0.877912     149144 0.6806854 89125.76

Ideally, you should infer synteny networks using syntenet with multiple combinations of parameters and assess each network to pick the best. To demonstrate it, let’s simulate different networks through resampling and calculate scores for each of them with the wrapper function assess_synnet_list().

# Simulate networks
net1 <- synnet
net2 <- synnet[-sample(1:10000, 500), ]
net3 <- synnet[-sample(1:10000, 1000), ]
synnet_list <- list(
  net1 = net1, 
  net2 = net2, 
  net3 = net3
)

# Assess original network + 2 simulations
synnet_assesment <- assess_synnet_list(synnet_list)
synnet_assesment
#>          CC Node_count  Rsquared    Score Network
#> 1 0.8779120     149144 0.6806854 89125.76    net1
#> 2 0.8769428     149133 0.6813367 89105.97    net2
#> 3 0.8758974     149114 0.6810978 88957.20    net3

# Determine the best network
synnet_assesment$Network[which.max(synnet_assesment$Score)]
#> [1] "net1"

As you can see, the first (original) network is the best, as it has the highest score.

Session information

This document was created under the following conditions:

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.2 (2024-10-31)
#>  os       Ubuntu 24.04.1 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  C
#>  ctype    en_US.UTF-8
#>  tz       Etc/UTC
#>  date     2024-11-19
#>  pandoc   3.2.1 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package          * version  date (UTC) lib source
#>  ape                5.8      2024-04-11 [2] RSPM (R 4.4.0)
#>  aplot              0.2.3    2024-06-17 [2] RSPM (R 4.4.0)
#>  beeswarm           0.4.0    2021-06-01 [2] RSPM (R 4.4.0)
#>  BiocGenerics       0.53.3   2024-11-15 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  BiocManager        1.30.25  2024-08-28 [2] RSPM (R 4.4.0)
#>  BiocStyle        * 2.35.0   2024-11-19 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  Biostrings         2.75.1   2024-11-07 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  bslib              0.8.0    2024-07-29 [2] RSPM (R 4.4.0)
#>  buildtools         1.0.0    2024-11-18 [3] local (/pkg)
#>  cachem             1.1.0    2024-05-16 [2] RSPM (R 4.4.0)
#>  cli                3.6.3    2024-06-21 [2] RSPM (R 4.4.0)
#>  cogeqc           * 1.11.0   2024-11-19 [1] https://bioc.r-universe.dev (R 4.4.2)
#>  colorspace         2.1-1    2024-07-26 [2] RSPM (R 4.4.0)
#>  crayon             1.5.3    2024-06-20 [2] RSPM (R 4.4.0)
#>  digest             0.6.37   2024-08-19 [2] RSPM (R 4.4.0)
#>  dplyr              1.1.4    2023-11-17 [2] RSPM (R 4.4.0)
#>  evaluate           1.0.1    2024-10-10 [2] RSPM (R 4.4.0)
#>  fansi              1.0.6    2023-12-08 [2] RSPM (R 4.4.0)
#>  farver             2.1.2    2024-05-13 [2] RSPM (R 4.4.0)
#>  fastmap            1.2.0    2024-05-15 [2] RSPM (R 4.4.0)
#>  fs                 1.6.5    2024-10-30 [2] RSPM (R 4.4.0)
#>  generics           0.1.3    2022-07-05 [2] RSPM (R 4.4.0)
#>  GenomeInfoDb       1.43.1   2024-11-18 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  GenomeInfoDbData   1.2.13   2024-11-19 [2] Bioconductor
#>  ggbeeswarm         0.7.2    2023-04-29 [2] RSPM (R 4.4.0)
#>  ggfun              0.1.7    2024-10-24 [2] RSPM (R 4.4.0)
#>  ggplot2            3.5.1    2024-04-23 [2] RSPM (R 4.4.0)
#>  ggplotify          0.1.2    2023-08-09 [2] RSPM (R 4.4.0)
#>  ggtree             3.15.0   2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  glue               1.8.0    2024-09-30 [2] RSPM (R 4.4.0)
#>  gridGraphics       0.5-1    2020-12-13 [2] RSPM (R 4.4.0)
#>  gtable             0.3.6    2024-10-25 [2] RSPM (R 4.4.0)
#>  htmltools          0.5.8.1  2024-04-04 [2] RSPM (R 4.4.0)
#>  httr               1.4.7    2023-08-15 [2] RSPM (R 4.4.0)
#>  igraph             2.1.1    2024-10-19 [2] RSPM (R 4.4.0)
#>  IRanges            2.41.1   2024-11-17 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  jquerylib          0.1.4    2021-04-26 [2] RSPM (R 4.4.0)
#>  jsonlite           1.8.9    2024-09-20 [2] RSPM (R 4.4.0)
#>  knitr              1.49     2024-11-08 [2] RSPM (R 4.4.0)
#>  labeling           0.4.3    2023-08-29 [2] RSPM (R 4.4.0)
#>  lattice            0.22-6   2024-03-20 [2] RSPM (R 4.4.0)
#>  lazyeval           0.2.2    2019-03-15 [2] RSPM (R 4.4.0)
#>  lifecycle          1.0.4    2023-11-07 [2] RSPM (R 4.4.0)
#>  magrittr           2.0.3    2022-03-30 [2] RSPM (R 4.4.0)
#>  maketools          1.3.1    2024-10-04 [3] RSPM (R 4.4.0)
#>  munsell            0.5.1    2024-04-01 [2] RSPM (R 4.4.0)
#>  nlme               3.1-166  2024-08-14 [2] RSPM (R 4.4.0)
#>  patchwork          1.3.0    2024-09-16 [2] RSPM (R 4.4.0)
#>  pillar             1.9.0    2023-03-22 [2] RSPM (R 4.4.0)
#>  pkgconfig          2.0.3    2019-09-22 [2] RSPM (R 4.4.0)
#>  plyr               1.8.9    2023-10-02 [2] RSPM (R 4.4.0)
#>  purrr              1.0.2    2023-08-10 [2] RSPM (R 4.4.0)
#>  R6                 2.5.1    2021-08-19 [2] RSPM (R 4.4.0)
#>  Rcpp               1.0.13-1 2024-11-02 [2] RSPM (R 4.4.0)
#>  reshape2           1.4.4    2020-04-09 [2] RSPM (R 4.4.0)
#>  rlang              1.1.4    2024-06-04 [2] RSPM (R 4.4.0)
#>  rmarkdown          2.29     2024-11-04 [2] RSPM (R 4.4.0)
#>  S4Vectors          0.45.2   2024-11-16 [2] https://bioc.r-universe.dev (R 4.4.2)
#>  sass               0.4.9    2024-03-15 [2] RSPM (R 4.4.0)
#>  scales             1.3.0    2023-11-28 [2] RSPM (R 4.4.0)
#>  sessioninfo        1.2.2    2021-12-06 [2] RSPM (R 4.4.0)
#>  stringi            1.8.4    2024-05-06 [2] RSPM (R 4.4.0)
#>  stringr            1.5.1    2023-11-14 [2] RSPM (R 4.4.0)
#>  sys                3.4.3    2024-10-04 [2] RSPM (R 4.4.0)
#>  tibble             3.2.1    2023-03-20 [2] RSPM (R 4.4.0)
#>  tidyr              1.3.1    2024-01-24 [2] RSPM (R 4.4.0)
#>  tidyselect         1.2.1    2024-03-11 [2] RSPM (R 4.4.0)
#>  tidytree           0.4.6    2023-12-12 [2] RSPM (R 4.4.0)
#>  treeio             1.31.0   2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  UCSC.utils         1.3.0    2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  utf8               1.2.4    2023-10-22 [2] RSPM (R 4.4.0)
#>  vctrs              0.6.5    2023-12-01 [2] RSPM (R 4.4.0)
#>  vipor              0.4.7    2023-12-18 [2] RSPM (R 4.4.0)
#>  withr              3.0.2    2024-10-28 [2] RSPM (R 4.4.0)
#>  xfun               0.49     2024-10-31 [2] RSPM (R 4.4.0)
#>  XVector            0.47.0   2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#>  yaml               2.3.10   2024-07-26 [2] RSPM (R 4.4.0)
#>  yulab.utils        0.1.8    2024-11-07 [2] RSPM (R 4.4.0)
#>  zlibbioc           1.52.0   2024-10-29 [2] Bioconductor 3.20 (R 4.4.2)
#> 
#>  [1] /tmp/Rtmpa1SeZt/Rinst15362d3293f0
#>  [2] /github/workspace/pkglib
#>  [3] /usr/local/lib/R/site-library
#>  [4] /usr/lib/R/site-library
#>  [5] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

References

Zhao, Tao, and M Eric Schranz. 2019. “Network-Based Microsynteny Analysis Identifies Major Differences and Genomic Outliers in Mammalian and Angiosperm Genomes.” Proceedings of the National Academy of Sciences 116 (6): 2165–74.