CNEr identifies conserved noncoding elments (CNEs) from pairwise whole genome alignment (net.axt files) of two species. UCSC has provided alignments between many species on the downloads, hence it is highly recommended to use their alignments when available. When the alignments of some new assemblies/species are not availble from UCSC yet, this vignette describes the pipeline of generating the alignments merely from soft-masked 2bit files or fasta files. This vignette is based on the genomewiki from UCSC.
Note:
List of external softwares have to be installed on the machine: * The sequence alignment program LASTZ * Kent Utilities. In this pipeline, lavToPsl, axtChain, chainMergeSort, chainPreNet, chainNet, netSyntenic, netToAxt, axtSort are essential. netClass is optional.
Here, as an example, we will get the pairwise alignment only on “chr1”, “chr2” and “chr3” between zebrafish(danRer10) and human(hg38).
First we need to download the 2bit files from UCSC, and set
the appropriate paths of assemblyTarget
,
assemblyQuery
and intermediate files. Then we can run
lastz
function to generate the lav files.
## lastz aligner
assemblyDir <- "/Users/gtan/OneDrive/Project/CSC/CNEr/2bit"
axtDir <- "/Users/gtan/OneDrive/Project/CSC/CNEr/axt"
assemblyTarget <- file.path(system.file("extdata",
package="BSgenome.Drerio.UCSC.danRer10"),
"single_sequences.2bit")
assemblyQuery <- file.path(system.file("extdata",
package="BSgenome.Hsapiens.UCSC.hg38"),
"single_sequences.2bit")
lavs <- lastz(assemblyTarget, assemblyQuery,
outputDir=axtDir,
chrsTarget=c("chr1", "chr2", "chr3"),
chrsQuery=c("chr1", "chr2", "chr3"),
distance="far", mc.cores=4)
## lav files to psl files conversion
psls <- lavToPsl(lavs, removeLav=FALSE, binary="lavToPsl")
One essential argument here is the distance
. It
determines the scoring matrix to use in lastz
aligner. See
?scoringMatrix
for more details.
Note: This step may encounter difficulties if the
two assemblies are too fragmented, because there can be millions of
combinations of the chromosomes/scaffolds. The lastz
is
overwhelmed with reading small pieces from assembly for each
combination, rather than doing actual alignment. In this case, another
aligner last is recommended and
introduced in nex section.
last
aligner is considered faster and memory efficient.
It creates maf file, which can converted to psl files.
Then the same following processes can be used on psl files.
Different from lastz
, last
aligner starts
with fasta files. The target genome sequence has to build the
index file first, and then align with the query genome
sequence.
## Build the lastdb index
system2(command="lastdb", args=c("-c", file.path(assemblyDir, "danRer10"),
file.path(assemblyDir, "danRer10.fa")))
## Run last aligner
lastal(db=file.path(assemblyDir, "danRer10"),
queryFn=file.path(assemblyDir, "hg38.fa"),
outputFn=file.path(axtDir, "danRer10.hg38.maf"),
distance="far", binary="lastal", mc.cores=4L)
## maf to psl
psls <- file.path(axtDir, "danRer10.hg38.psl")
system2(command="maf-convert", args=c("psl",
file.path(axtDir, "danRer10.hg38.maf"),
">", psls))
Another alternative of alignment software is YASS. It may be added into this pipeline after we test the performance.
If two matching alignments next to each other are close enough, they are joined into one fragment. Then these chain files are sorted and combined into one big file.
## Join close alignments
chains <- axtChain(psls, assemblyTarget=assemblyTarget,
assemblyQuery=assemblyQuery, distance="far",
removePsl=FALSE, binary="axtChain")
## Sort and combine
allChain <- chainMergeSort(chains, assemblyTarget, assemblyQuery,
allChain=file.path(axtDir,
paste0(sub("\\.2bit$", "", basename(assemblyTarget),
ignore.case=TRUE), ".",
sub("\\.2bit$", "", basename(assemblyQuery),
ignore.case=TRUE), ".all.chain")),
removeChains=FALSE, binary="chainMergeSort")
In this step, first we filter out chains that are unlikely to be
netted by chainPreNet
. During the alignment, every genomic
fragment can match with several others, and certainly we want to keep
the best one. This is done by chainNet
. Then we add the
synteny information with netSyntenic
.
## Filtering out chains
allPreChain <- chainPreNet(allChain, assemblyTarget, assemblyQuery,
allPreChain=file.path(axtDir,
paste0(sub("\\.2bit$", "",
basename(assemblyTarget),
ignore.case = TRUE), ".",
sub("\\.2bit$", "",
basename(assemblyQuery),
ignore.case = TRUE),
".all.pre.chain")),
removeAllChain=FALSE, binary="chainPreNet")
## Keep the best chain and add synteny information
netSyntenicFile <- chainNetSyntenic(allPreChain, assemblyTarget, assemblyQuery,
netSyntenicFile=file.path(axtDir,
paste0(sub("\\.2bit$", "",
basename(assemblyTarget),
ignore.case = TRUE), ".",
sub("\\.2bit$", "",
basename(assemblyQuery),
ignore.case = TRUE),
".noClass.net")),
binaryChainNet="chainNet", binaryNetSyntenic="netSyntenic")
As the last step, we create the .net.axt file from the previous net and chain files.
netToAxt(netSyntenicFile, allPreChain, assemblyTarget, assemblyQuery,
axtFile=file.path(axtDir,
paste0(sub("\\.2bit$", "",
basename(assemblyTarget),
ignore.case = TRUE), ".",
sub("\\.2bit$", "",
basename(assemblyQuery),
ignore.case = TRUE),
".net.axt")),
removeFiles=FALSE,
binaryNetToAxt="netToAxt", binaryAxtSort="axtSort")
Here is the output of sessionInfo()
on the system on
which this document was compiled:
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Gviz_1.51.0 BSgenome.Ggallus.UCSC.galGal3_1.4.0
## [3] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.75.0
## [5] rtracklayer_1.67.0 BiocIO_1.17.0
## [7] Biostrings_2.75.0 XVector_0.47.0
## [9] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
## [11] IRanges_2.41.0 S4Vectors_0.45.0
## [13] BiocGenerics_0.53.1 generics_0.1.3
## [15] CNEr_1.43.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.17.1
## [3] sys_3.4.3 jsonlite_1.8.9
## [5] magrittr_2.0.3 GenomicFeatures_1.59.0
## [7] farver_2.1.2 rmarkdown_2.28
## [9] zlibbioc_1.52.0 vctrs_0.6.5
## [11] memoise_2.0.1 Rsamtools_2.23.0
## [13] RCurl_1.98-1.16 base64enc_0.1-3
## [15] htmltools_0.5.8.1 S4Arrays_1.7.1
## [17] progress_1.2.3 curl_5.2.3
## [19] SparseArray_1.7.0 Formula_1.2-5
## [21] sass_0.4.9 pracma_2.4.4
## [23] bslib_0.8.0 htmlwidgets_1.6.4
## [25] plyr_1.8.9 httr2_1.0.5
## [27] cachem_1.1.0 buildtools_1.0.0
## [29] GenomicAlignments_1.43.0 lifecycle_1.0.4
## [31] pkgconfig_2.0.3 Matrix_1.7-1
## [33] R6_2.5.1 fastmap_1.2.0
## [35] GenomeInfoDbData_1.2.13 MatrixGenerics_1.19.0
## [37] digest_0.6.37 colorspace_2.1-1
## [39] AnnotationDbi_1.69.0 Hmisc_5.2-0
## [41] RSQLite_2.3.7 filelock_1.0.3
## [43] labeling_0.4.3 fansi_1.0.6
## [45] httr_1.4.7 abind_1.4-8
## [47] compiler_4.4.1 bit64_4.5.2
## [49] withr_3.0.2 backports_1.5.0
## [51] htmlTable_2.4.3 BiocParallel_1.41.0
## [53] DBI_1.2.3 highr_0.11
## [55] R.utils_2.12.3 biomaRt_2.63.0
## [57] rappdirs_0.3.3 poweRlaw_0.80.0
## [59] DelayedArray_0.33.1 rjson_0.2.23
## [61] tools_4.4.1 foreign_0.8-87
## [63] nnet_7.3-19 R.oo_1.27.0
## [65] glue_1.8.0 restfulr_0.0.15
## [67] checkmate_2.3.2 cluster_2.1.6
## [69] reshape2_1.4.4 gtable_0.3.6
## [71] tzdb_0.4.0 R.methodsS3_1.8.2
## [73] ensembldb_2.31.0 data.table_1.16.2
## [75] hms_1.1.3 xml2_1.3.6
## [77] utf8_1.2.4 pillar_1.9.0
## [79] stringr_1.5.1 vroom_1.6.5
## [81] dplyr_1.1.4 BiocFileCache_2.15.0
## [83] lattice_0.22-6 deldir_2.0-4
## [85] bit_4.5.0 annotate_1.85.0
## [87] biovizBase_1.55.0 tidyselect_1.2.1
## [89] GO.db_3.20.0 maketools_1.3.1
## [91] knitr_1.48 gridExtra_2.3
## [93] ProtGenerics_1.39.0 SummarizedExperiment_1.37.0
## [95] xfun_0.49 Biobase_2.67.0
## [97] matrixStats_1.4.1 stringi_1.8.4
## [99] UCSC.utils_1.3.0 lazyeval_0.2.2
## [101] yaml_2.3.10 evaluate_1.0.1
## [103] codetools_0.2-20 interp_1.1-6
## [105] tibble_3.2.1 BiocManager_1.30.25
## [107] cli_3.6.3 rpart_4.1.23
## [109] xtable_1.8-4 munsell_0.5.1
## [111] jquerylib_0.1.4 dichromat_2.0-0.1
## [113] Rcpp_1.0.13-1 dbplyr_2.5.0
## [115] png_0.1-8 XML_3.99-0.17
## [117] parallel_4.4.1 ggplot2_3.5.1
## [119] readr_2.1.5 blob_1.2.4
## [121] prettyunits_1.2.0 jpeg_0.1-10
## [123] latticeExtra_0.6-30 AnnotationFilter_1.31.0
## [125] bitops_1.0-9 pwalign_1.3.0
## [127] VariantAnnotation_1.53.0 scales_1.3.0
## [129] crayon_1.5.3 rlang_1.1.4
## [131] KEGGREST_1.47.0