Pairwise whole genome alignment

Introduction

CNEr identifies conserved noncoding elments (CNEs) from pairwise whole genome alignment (net.axt files) of two species. UCSC has provided alignments between many species on the downloads, hence it is highly recommended to use their alignments when available. When the alignments of some new assemblies/species are not availble from UCSC yet, this vignette describes the pipeline of generating the alignments merely from soft-masked 2bit files or fasta files. This vignette is based on the genomewiki from UCSC.

Note:

  • The whole pipeline relies on Kent’s utilities. Since Kent’s utilities don’t support Windows platform, this pipeline also only works on Linux and Unix platform.
  • Many of the functions in this pipeline are just wrapper functions, in order to simplfy the settings and work consistently with the rest of CNEs pipeline within R environment.
  • Due to large file size of the genome assemblies, the commands in this vignette don’t run during the vignette compilation. Users have to set the correct paths on their machine in order to make these commands work.

Prerequisite

List of external softwares have to be installed on the machine: * The sequence alignment program LASTZ * Kent Utilities. In this pipeline, lavToPsl, axtChain, chainMergeSort, chainPreNet, chainNet, netSyntenic, netToAxt, axtSort are essential. netClass is optional.

Aligning

Here, as an example, we will get the pairwise alignment only on “chr1”, “chr2” and “chr3” between zebrafish(danRer10) and human(hg38).

LASTZ aligner

First we need to download the 2bit files from UCSC, and set the appropriate paths of assemblyTarget, assemblyQuery and intermediate files. Then we can run lastz function to generate the lav files.

## lastz aligner
assemblyDir <- "/Users/gtan/OneDrive/Project/CSC/CNEr/2bit"
axtDir <- "/Users/gtan/OneDrive/Project/CSC/CNEr/axt"
assemblyTarget <- file.path(system.file("extdata",
                            package="BSgenome.Drerio.UCSC.danRer10"),
                            "single_sequences.2bit")
assemblyQuery <- file.path(system.file("extdata",
                                       package="BSgenome.Hsapiens.UCSC.hg38"),
                           "single_sequences.2bit")
lavs <- lastz(assemblyTarget, assemblyQuery, 
              outputDir=axtDir,
              chrsTarget=c("chr1", "chr2", "chr3"),
              chrsQuery=c("chr1", "chr2", "chr3"),
              distance="far", mc.cores=4)

## lav files to psl files conversion
psls <- lavToPsl(lavs, removeLav=FALSE, binary="lavToPsl")

One essential argument here is the distance. It determines the scoring matrix to use in lastz aligner. See ?scoringMatrix for more details.

Note: This step may encounter difficulties if the two assemblies are too fragmented, because there can be millions of combinations of the chromosomes/scaffolds. The lastz is overwhelmed with reading small pieces from assembly for each combination, rather than doing actual alignment. In this case, another aligner last is recommended and introduced in nex section.

last aligner

last aligner is considered faster and memory efficient. It creates maf file, which can converted to psl files. Then the same following processes can be used on psl files.

Different from lastz, last aligner starts with fasta files. The target genome sequence has to build the index file first, and then align with the query genome sequence.

## Build the lastdb index
system2(command="lastdb", args=c("-c", file.path(assemblyDir, "danRer10"),
                                 file.path(assemblyDir, "danRer10.fa")))

## Run last aligner
lastal(db=file.path(assemblyDir, "danRer10"),
       queryFn=file.path(assemblyDir, "hg38.fa"),
       outputFn=file.path(axtDir, "danRer10.hg38.maf"),
       distance="far", binary="lastal", mc.cores=4L)

## maf to psl 
psls <- file.path(axtDir, "danRer10.hg38.psl")
system2(command="maf-convert", args=c("psl", 
                                      file.path(axtDir, "danRer10.hg38.maf"),
                                      ">", psls))

YASS aligner

Another alternative of alignment software is YASS. It may be added into this pipeline after we test the performance.

Chaining:

If two matching alignments next to each other are close enough, they are joined into one fragment. Then these chain files are sorted and combined into one big file.

## Join close alignments
chains <- axtChain(psls, assemblyTarget=assemblyTarget,
                   assemblyQuery=assemblyQuery, distance="far",
                   removePsl=FALSE, binary="axtChain")

## Sort and combine
allChain <- chainMergeSort(chains, assemblyTarget, assemblyQuery,
              allChain=file.path(axtDir,
                         paste0(sub("\\.2bit$", "", basename(assemblyTarget),
                                    ignore.case=TRUE), ".", 
                                sub("\\.2bit$", "", basename(assemblyQuery), 
                                    ignore.case=TRUE), ".all.chain")),
                           removeChains=FALSE, binary="chainMergeSort")

Netting:

In this step, first we filter out chains that are unlikely to be netted by chainPreNet. During the alignment, every genomic fragment can match with several others, and certainly we want to keep the best one. This is done by chainNet. Then we add the synteny information with netSyntenic.

## Filtering out chains
allPreChain <- chainPreNet(allChain, assemblyTarget, assemblyQuery,
                           allPreChain=file.path(axtDir,
                                      paste0(sub("\\.2bit$", "", 
                                                 basename(assemblyTarget),
                                                 ignore.case = TRUE), ".", 
                                             sub("\\.2bit$", "",
                                                 basename(assemblyQuery),
                                                 ignore.case = TRUE),
                                                 ".all.pre.chain")),
                           removeAllChain=FALSE, binary="chainPreNet")

## Keep the best chain and add synteny information
netSyntenicFile <- chainNetSyntenic(allPreChain, assemblyTarget, assemblyQuery,
                     netSyntenicFile=file.path(axtDir,
                                               paste0(sub("\\.2bit$", "",
                                                      basename(assemblyTarget),
                                                      ignore.case = TRUE), ".",
                                                      sub("\\.2bit$", "",
                                                      basename(assemblyQuery),
                                                      ignore.case = TRUE),
                                               ".noClass.net")),
                     binaryChainNet="chainNet", binaryNetSyntenic="netSyntenic")

axtNet

As the last step, we create the .net.axt file from the previous net and chain files.

netToAxt(netSyntenicFile, allPreChain, assemblyTarget, assemblyQuery,
         axtFile=file.path(axtDir,
                           paste0(sub("\\.2bit$", "",
                                      basename(assemblyTarget),
                                      ignore.case = TRUE), ".",
                                  sub("\\.2bit$", "",
                                      basename(assemblyQuery),
                                      ignore.case = TRUE),
                                  ".net.axt")),
             removeFiles=FALSE,
             binaryNetToAxt="netToAxt", binaryAxtSort="axtSort")

Session info

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] Gviz_1.51.0                         BSgenome.Ggallus.UCSC.galGal3_1.4.0
##  [3] BSgenome.Hsapiens.UCSC.hg19_1.4.3   BSgenome_1.75.0                    
##  [5] rtracklayer_1.67.0                  BiocIO_1.17.0                      
##  [7] Biostrings_2.75.0                   XVector_0.47.0                     
##  [9] GenomicRanges_1.59.0                GenomeInfoDb_1.43.0                
## [11] IRanges_2.41.0                      S4Vectors_0.45.0                   
## [13] BiocGenerics_0.53.1                 generics_0.1.3                     
## [15] CNEr_1.43.0                         BiocStyle_2.35.0                   
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          rstudioapi_0.17.1          
##   [3] sys_3.4.3                   jsonlite_1.8.9             
##   [5] magrittr_2.0.3              GenomicFeatures_1.59.0     
##   [7] farver_2.1.2                rmarkdown_2.28             
##   [9] zlibbioc_1.52.0             vctrs_0.6.5                
##  [11] memoise_2.0.1               Rsamtools_2.23.0           
##  [13] RCurl_1.98-1.16             base64enc_0.1-3            
##  [15] htmltools_0.5.8.1           S4Arrays_1.7.1             
##  [17] progress_1.2.3              curl_5.2.3                 
##  [19] SparseArray_1.7.0           Formula_1.2-5              
##  [21] sass_0.4.9                  pracma_2.4.4               
##  [23] bslib_0.8.0                 htmlwidgets_1.6.4          
##  [25] plyr_1.8.9                  httr2_1.0.5                
##  [27] cachem_1.1.0                buildtools_1.0.0           
##  [29] GenomicAlignments_1.43.0    lifecycle_1.0.4            
##  [31] pkgconfig_2.0.3             Matrix_1.7-1               
##  [33] R6_2.5.1                    fastmap_1.2.0              
##  [35] GenomeInfoDbData_1.2.13     MatrixGenerics_1.19.0      
##  [37] digest_0.6.37               colorspace_2.1-1           
##  [39] AnnotationDbi_1.69.0        Hmisc_5.2-0                
##  [41] RSQLite_2.3.7               filelock_1.0.3             
##  [43] labeling_0.4.3              fansi_1.0.6                
##  [45] httr_1.4.7                  abind_1.4-8                
##  [47] compiler_4.4.1              bit64_4.5.2                
##  [49] withr_3.0.2                 backports_1.5.0            
##  [51] htmlTable_2.4.3             BiocParallel_1.41.0        
##  [53] DBI_1.2.3                   highr_0.11                 
##  [55] R.utils_2.12.3              biomaRt_2.63.0             
##  [57] rappdirs_0.3.3              poweRlaw_0.80.0            
##  [59] DelayedArray_0.33.1         rjson_0.2.23               
##  [61] tools_4.4.1                 foreign_0.8-87             
##  [63] nnet_7.3-19                 R.oo_1.27.0                
##  [65] glue_1.8.0                  restfulr_0.0.15            
##  [67] checkmate_2.3.2             cluster_2.1.6              
##  [69] reshape2_1.4.4              gtable_0.3.6               
##  [71] tzdb_0.4.0                  R.methodsS3_1.8.2          
##  [73] ensembldb_2.31.0            data.table_1.16.2          
##  [75] hms_1.1.3                   xml2_1.3.6                 
##  [77] utf8_1.2.4                  pillar_1.9.0               
##  [79] stringr_1.5.1               vroom_1.6.5                
##  [81] dplyr_1.1.4                 BiocFileCache_2.15.0       
##  [83] lattice_0.22-6              deldir_2.0-4               
##  [85] bit_4.5.0                   annotate_1.85.0            
##  [87] biovizBase_1.55.0           tidyselect_1.2.1           
##  [89] GO.db_3.20.0                maketools_1.3.1            
##  [91] knitr_1.48                  gridExtra_2.3              
##  [93] ProtGenerics_1.39.0         SummarizedExperiment_1.37.0
##  [95] xfun_0.49                   Biobase_2.67.0             
##  [97] matrixStats_1.4.1           stringi_1.8.4              
##  [99] UCSC.utils_1.3.0            lazyeval_0.2.2             
## [101] yaml_2.3.10                 evaluate_1.0.1             
## [103] codetools_0.2-20            interp_1.1-6               
## [105] tibble_3.2.1                BiocManager_1.30.25        
## [107] cli_3.6.3                   rpart_4.1.23               
## [109] xtable_1.8-4                munsell_0.5.1              
## [111] jquerylib_0.1.4             dichromat_2.0-0.1          
## [113] Rcpp_1.0.13-1               dbplyr_2.5.0               
## [115] png_0.1-8                   XML_3.99-0.17              
## [117] parallel_4.4.1              ggplot2_3.5.1              
## [119] readr_2.1.5                 blob_1.2.4                 
## [121] prettyunits_1.2.0           jpeg_0.1-10                
## [123] latticeExtra_0.6-30         AnnotationFilter_1.31.0    
## [125] bitops_1.0-9                pwalign_1.3.0              
## [127] VariantAnnotation_1.53.0    scales_1.3.0               
## [129] crayon_1.5.3                rlang_1.1.4                
## [131] KEGGREST_1.47.0