Intragenic exonic deletions are known to contribute to genetic
diseases and are often flanked by regions of homology. The Exome
Database of Interspersed Repeats (EDIR) was developed to provide an
overview of the positions of repetitive structures within the human
genome composed of interspersed repeats encompassing a coding sequence.
The package EDIRquery
provides user-friendly tools to query
this database for genes of interest.
EDIR provides a dataset of pairwise repeat structures in which both sequences are located within a maximum of 1000 bp from each other, and fulfill one of the following selection criteria:
>= 1 repeat located in an exon
Both repeats situated in different introns flanking one or more exons
A subset of EDIR is provided as example data, representing a subset of the interspersed repeats data for the gene GAA (ENSG00000171298) on chromosome 17.
To query the full the database, provide the data directory to
gene_lookup()
in the path
parameter.
To install this package, enter the following in R:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("EDIRquery")
Then load the package:
EDIR can easily be queried using the gene_lookup
function, using the gene name and additional parameters:
Argument | Description | Default |
---|---|---|
gene | required: The gene name (ENSEMBLE ID or HGNC symbol) | - |
length | Repeat sequence length, must be between 7 and 20. If NA, results will include all available lengths in dataset for queried gene | NA |
mindist | Minimum spacer distance (bp) between repeats | 0 |
maxdist | Maximum spacer distance (bp) between repeats | 1000 |
format | Output table format. One of ‘data.frame’, ‘GInteractions’. | ‘data.frame’ |
summary | Logical value indicating whether to store summary | FALSE |
mismatch | Logical value indicating whether to allow 1 mismatch in sequence | TRUE |
path | String containing path to directory holding downloaded dataset
files. If not provided (path = NA ), example subset of data
will be used |
NA |
A summary of the input printed to console, including the gene name, gene length (bp), Ensembl transcript ID, queried distance between repeats (default: 0-1000 bp), and an overview of total results for the given repeat length. Console outputs include runtime.
Example querying the gene “GAA” with repeats of length 7, and allowing for 1 mismatch:
# Summary of results (printed to console)
gene_lookup("GAA", length = 7, mismatch = TRUE)
#> Parameters
#> Repeat length: 7 bp
#>
#> Gene: ENSG00000171298 / GAA
#> Gene length: 18325 bp
#> Transcript ID: ENST00000302262
#> Distance: 0-1000 bp
#> Mismatch: TRUE
#>
#>
#> repeat_length unique_seqs tot_instances tot_structures avg_dist
#> 1 7 5172 10460 14562 486.2603
#> norm_instances_bp norm_instances_Mb norm_structures_bp norm_structures_Mb
#> 1 0.5708049 570804.9 0.7946521 794652.1
#>
#>
#> Runtime: 0.593 sec elapsed
If no length
is provided, a summary of all available
repeat length results will be printed:
# Summary of results (printed to console)
gene_lookup("GAA", mismatch = TRUE)
#> Parameters
#>
#>
#> Gene: ENSG00000171298 / GAA
#> Gene length: 18325 bp
#> Transcript ID: ENST00000302262
#> Distance: 0-1000 bp
#> Mismatch: TRUE
#>
#>
#> repeat_length unique_seqs tot_instances tot_structures avg_dist
#> 1 7 5172 10460 14562 486.2603
#> 2 8 5677 7592 7062 516.1827
#> 3 9 3160 3461 2226 508.7588
#> 4 10 1172 1227 690 500.2217
#> 5 11 389 399 209 492.5263
#> 6 12 122 124 63 454.6190
#> 7 13 42 42 21 346.2857
#> 8 14 14 14 7 271.1429
#> 9 15 4 4 2 43.0000
#> 10 16 2 2 1 42.0000
#> norm_instances_bp norm_instances_Mb norm_structures_bp norm_structures_Mb
#> 1 0.5708049113 570804.9113 7.946521e-01 794652.11460
#> 2 0.4142974079 414297.4079 3.853752e-01 385375.17053
#> 3 0.1888676671 188867.6671 1.214734e-01 121473.39700
#> 4 0.0669577080 66957.7080 3.765348e-02 37653.47885
#> 5 0.0217735334 21773.5334 1.140518e-02 11405.18417
#> 6 0.0067667121 6766.7121 3.437926e-03 3437.92633
#> 7 0.0022919509 2291.9509 1.145975e-03 1145.97544
#> 8 0.0007639836 763.9836 3.819918e-04 381.99181
#> 9 0.0002182810 218.2810 1.091405e-04 109.14052
#> 10 0.0001091405 109.1405 5.457026e-05 54.57026
#>
#>
#> Runtime: 0.629 sec elapsed
Storing the output in a variable allows viewing of the individual results in the output dataframe:
# Database output of query
results <- gene_lookup("GAA", length = 7, mismatch = TRUE)
#> Parameters
#> Repeat length: 7 bp
#>
#> Gene: ENSG00000171298 / GAA
#> Gene length: 18325 bp
#> Transcript ID: ENST00000302262
#> Distance: 0-1000 bp
#> Mismatch: TRUE
#>
#>
#> repeat_length unique_seqs tot_instances tot_structures avg_dist
#> 1 7 5172 10460 14562 486.2603
#> norm_instances_bp norm_instances_Mb norm_structures_bp norm_structures_Mb
#> 1 0.5708049 570804.9 0.7946521 794652.1
#>
#>
#> Runtime: 0.482 sec elapsed
head(results)
#> chromosome repeat_length start1 end1 start2 end2 repeat_seq1
#> 3930 17 7 80101595 80101601 80101734 80101740 CCGCGGG
#> 3931 17 7 80105602 80105608 80105843 80105849 CCGAGGC
#> 3932 17 7 80110005 80110011 80110061 80110067 CGGAGGG
#> 3933 17 7 80118254 80118260 80118270 80118276 CCAAGGG
#> 3934 17 7 80118270 80118276 80118318 80118324 CCGAGGG
#> 3935 17 7 80118270 80118276 80118533 80118539 CCGAGGG
#> intron_exon1 repeat_seq2 intron_exon2 distance ensembl_gene_id hgnc_symbol
#> 3930 E1 CCGCGGG E1 132 ENSG00000171298 GAA
#> 3931 I2 CCGAGGA E3 234 ENSG00000171298 GAA
#> 3932 E9 GCGAGGG I9 49 ENSG00000171298 GAA
#> 3933 E18 CCGAGGG E18 9 ENSG00000171298 GAA
#> 3934 E18 GCGAGGG E18 41 ENSG00000171298 GAA
#> 3935 E18 CAGAGGG I18 256 ENSG00000171298 GAA
#> gene_range ensembl_transcript_id transcript_range
#> 3930 80101556-80119881 ENST00000302262 80101581-80101890
#> 3931 80101556-80119881 ENST00000302262 80105133-80105748
#> 3932 80101556-80119881 ENST00000302262 80109945-80110055
#> 3933 80101556-80119881 ENST00000302262 80118193-80118357
#> 3934 80101556-80119881 ENST00000302262 80118193-80118357
#> 3935 80101556-80119881 ENST00000302262 80118193-80118357
#> feature mismatch
#> 3930 same exon 0
#> 3931 spanning intron-exon 1
#> 3932 spanning intron-exon 1
#> 3933 same exon 1
#> 3934 same exon 1
#> 3935 spanning intron-exon 1
Changing the format
parameter to
GInteractions
returns a GenomicInteractions object instead
of a dataframe:
# Database output of query
results <- gene_lookup("GAA", length = 7, format = "GInteractions", mismatch = TRUE)
#> Parameters
#> Repeat length: 7 bp
#>
#> Gene: ENSG00000171298 / GAA
#> Gene length: 18325 bp
#> Transcript ID: ENST00000302262
#> Distance: 0-1000 bp
#> Mismatch: TRUE
#>
#>
#> repeat_length unique_seqs tot_instances tot_structures avg_dist
#> 1 7 5172 10460 14562 486.2603
#> norm_instances_bp norm_instances_Mb norm_structures_bp norm_structures_Mb
#> 1 0.5708049 570804.9 0.7946521 794652.1
#>
#>
#> Runtime: 0.592 sec elapsed
head(results)
#> GInteractions object with 6 interactions and 11 metadata columns:
#> seqnames1 ranges1 seqnames2 ranges2 |
#> <Rle> <IRanges> <Rle> <IRanges> |
#> [1] 17 80101595-80101601 --- 17 80101734-80101740 |
#> [2] 17 80105602-80105608 --- 17 80105843-80105849 |
#> [3] 17 80110005-80110011 --- 17 80110061-80110067 |
#> [4] 17 80118254-80118260 --- 17 80118270-80118276 |
#> [5] 17 80118270-80118276 --- 17 80118318-80118324 |
#> [6] 17 80118270-80118276 --- 17 80118533-80118539 |
#> anchor1.repeat_seq1 anchor1.intron_exon1 anchor2.repeat_seq2
#> <character> <character> <character>
#> [1] CCGCGGG E1 CCGCGGG
#> [2] CCGAGGC I2 CCGAGGA
#> [3] CGGAGGG E9 GCGAGGG
#> [4] CCAAGGG E18 CCGAGGG
#> [5] CCGAGGG E18 GCGAGGG
#> [6] CCGAGGG E18 CAGAGGG
#> anchor2.intron_exon2 ensembl_gene_id hgnc_symbol gene_range
#> <character> <character> <character> <character>
#> [1] E1 ENSG00000171298 GAA 80101556-80119881
#> [2] E3 ENSG00000171298 GAA 80101556-80119881
#> [3] I9 ENSG00000171298 GAA 80101556-80119881
#> [4] E18 ENSG00000171298 GAA 80101556-80119881
#> [5] E18 ENSG00000171298 GAA 80101556-80119881
#> [6] I18 ENSG00000171298 GAA 80101556-80119881
#> ensembl_transcript_id transcript_range feature mismatch
#> <character> <character> <character> <integer>
#> [1] ENST00000302262 80101581-80101890 same exon 0
#> [2] ENST00000302262 80105133-80105748 spanning intron-exon 1
#> [3] ENST00000302262 80109945-80110055 spanning intron-exon 1
#> [4] ENST00000302262 80118193-80118357 same exon 1
#> [5] ENST00000302262 80118193-80118357 same exon 1
#> [6] ENST00000302262 80118193-80118357 spanning intron-exon 1
#> -------
#> regions: 10460 ranges and 0 metadata columns
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Database output of query
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] EDIRquery_1.7.0 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 generics_0.1.3
#> [3] SparseArray_1.7.2 lattice_0.22-6
#> [5] hms_1.1.3 digest_0.6.37
#> [7] magrittr_2.0.3 evaluate_1.0.1
#> [9] grid_4.4.2 fastmap_1.2.0
#> [11] jsonlite_1.8.9 Matrix_1.7-1
#> [13] GenomeInfoDb_1.43.2 httr_1.4.7
#> [15] UCSC.utils_1.3.0 jquerylib_0.1.4
#> [17] abind_1.4-8 cli_3.6.3
#> [19] rlang_1.1.4 crayon_1.5.3
#> [21] XVector_0.47.1 Biobase_2.67.0
#> [23] bit64_4.5.2 cachem_1.1.0
#> [25] DelayedArray_0.33.3 yaml_2.3.10
#> [27] S4Arrays_1.7.1 tools_4.4.2
#> [29] tzdb_0.4.0 GenomeInfoDbData_1.2.13
#> [31] InteractionSet_1.35.0 SummarizedExperiment_1.37.0
#> [33] BiocGenerics_0.53.3 buildtools_1.0.0
#> [35] vctrs_0.6.5 R6_2.5.1
#> [37] matrixStats_1.4.1 stats4_4.4.2
#> [39] lifecycle_1.0.4 bit_4.5.0.1
#> [41] tictoc_1.2.1 S4Vectors_0.45.2
#> [43] IRanges_2.41.2 vroom_1.6.5
#> [45] pkgconfig_2.0.3 bslib_0.8.0
#> [47] pillar_1.10.0 glue_1.8.0
#> [49] Rcpp_1.0.13-1 tidyselect_1.2.1
#> [51] xfun_0.49 tibble_3.2.1
#> [53] GenomicRanges_1.59.1 sys_3.4.3
#> [55] MatrixGenerics_1.19.0 knitr_1.49
#> [57] htmltools_0.5.8.1 maketools_1.3.1
#> [59] readr_2.1.5 compiler_4.4.2