library(comapr)
library(GenomicRanges)
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
library(BiocParallel)
Install via BiocManager as follow:
comapr
lets you interrogate the genotyping results for a
sequence of markers across the chromosome and detect meiotic crossovers
from genotype shifts. In this document, we demonstrate how genetic
distances are calculated from genotyping results for a group of samples
using functions available from comapr
.
The package includes a small data object that contains 70 markers and their genotyping results for 22 samples. The 22 samples are the BC1F1 progenies generated by
Therefore, the genotype shifts detected in BC1F1 samples represent crossovers that have happened during meiosis for the F1 parents.
BiocParallel::register(SerialParam())
BiocParallel::bpparam()
#> class: SerialParam
#> bpisup: FALSE; bpnworkers: 1; bptasks: 0; bpjobname: BPJOB
#> bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
#> bpRNGseed: ; bptimeout: NA; bpprogressbar: FALSE
#> bpexportglobals: FALSE; bpexportvariables: FALSE; bpforceGC: FALSE
#> bpfallback: FALSE
#> bplogdir: NA
#> bpresultdir: NA
In order to detect crossovers from the 22 samples’ genotype results
(BC1F1 samples), we need to format the result into
Homo_ref
, Homo_alt
and Het
encodings. This can be simply done through comparing with the parents’
genotypes.
Here, the genotype noted as “Fail” will be converted to NA. Please also note that we supplied “Homo_ref” as one of the fail options because “Homo_ref” is not in the possible genotypes of markers in BC1F1 samples.
corrected_geno <- correctGT(gt_matrix = mcols(snp_geno_gr),
ref = parents_geno$ref,
alt = parents_geno$alt,
fail = "Fail",
wrong_label = "Homo_ref")
mcols(snp_geno_gr) <- corrected_geno
The corrected_geno
matrix
head(mcols(snp_geno_gr)[,1:5])
#> DataFrame with 6 rows and 5 columns
#> X92 X93 X94 X95 X96
#> <character> <character> <character> <character> <character>
#> 1 Homo_alt Homo_alt Homo_alt Homo_alt Homo_alt
#> 2 Homo_alt Homo_alt Homo_alt Homo_alt Homo_alt
#> 3 Homo_alt Homo_alt Homo_alt Homo_alt Homo_alt
#> 4 Homo_alt Homo_alt Homo_alt Homo_alt Homo_alt
#> 5 NA Homo_alt Homo_alt Homo_alt Homo_alt
#> 6 Het Homo_alt Homo_alt Homo_alt Homo_alt
Note that there are missing values in this resulting matrix that can be resulted from:
NA
In this step, we try to identify markers that have NA
genotype across many samples or samples that have a lot markers failed
for removal. We use the countGT
function for find bad
markers/samples.
The number of markes and samples are saved in a list returned by
countGT
genotype_counts$n_markers
#> [1] 30 33 33 33 33 33 31 31 32 33 33 33 32 33 33 33 32 32 33 32 33 33
genotype_counts$n_samples
#> [1] 22 22 22 22 18 22 22 22 22 22 22 21 22 22 22 22 22 22 22 22 22 22 22 22 22
#> [26] 22 22 22 16 22 22 21 22
We now filter out markers/samples by using function
filterGT
. min_markers
specifies at least how
many markers a sample needs to be kept. Likewise for
min_samples
.
A printed message contains information about how much markers or samples have been filtered.
Sample duplicates are identified by finding samples that share
exactly same genotypes across all available markers.
findDupSamples
can be applied and a threshold value is
provided and used as a cut-off on the percentage of same genotype
markers the duplicated samples should share.
dups <- findDupSamples(mcols(corrected_geno),
threshold = 0.99)
dups
#> X98
#> [1,] "X98"
#> [2,] "X99"
Now we remove the duplicated samples.
Crossovers are detected and counted through examining the patterns of
genotypes along the chromosome. When there is a shift from one genotype
block to another, a crossover is observed. This is done through calling
countCOs
function which returns a GRange
object with crossover counts for the list of marker intervals.
The crossover count values in the columns can be non-integer when one observed crossover can not be determined to be completely distributed to the marker interval in the corresponding row. The observed crossover is then distributed to the adjacent intervals proportionally to their interval base pair sizes.
marker_gr_cos <- countCOs(corrected_geno)
marker_gr_cos[1:5,1:5]
#> GRanges object with 5 ranges and 5 metadata columns:
#> seqnames ranges strand | X92 X93 X94
#> <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
#> [1] 1 6655965-21638463 * | 0.9358742 0 0
#> [2] 1 21638465-22665059 * | 0.0641258 0 0
#> [3] 1 22665061-34590735 * | 0.0000000 0 0
#> [4] 1 35033642-38996025 * | 0.0000000 0 0
#> [5] 2 4248665-5348752 * | 0.0000000 0 0
#> X95 X96
#> <numeric> <numeric>
#> [1] 0 0
#> [2] 0 0
#> [3] 0 0
#> [4] 0 0
#> [5] 0 0
#> -------
#> seqinfo: 4 sequences from an unspecified genome; no seqlengths
The genetic distances of marker intervals are calcuated based on the
crossover rates via applying mapping function, Kosambi
or
Haldane
by calling the calGeneticDist
function. The returned genetic distances are in unit of centiMorgan.
dist_gr <- calGeneticDist(marker_gr_cos,
mapping_fun = "k")
dist_gr[1:5,]
#> GRanges object with 5 ranges and 1 metadata column:
#> seqnames ranges strand | kosambi_cm
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] 1 6655965-21638463 * | 14.36279
#> [2] 1 21638465-22665059 * | 5.08472
#> [3] 1 22665061-34590735 * | 9.64156
#> [4] 1 35033642-38996025 * | 9.64156
#> [5] 2 4248665-5348752 * | 4.77638
#> -------
#> seqinfo: 4 sequences from an unspecified genome; no seqlengths
Alternatively, instead of returning the genetic distances in supplied
marker intervals, we can specify a bin_size
which tells
calGeneticDist
to return calculated genetic distances for
equally sized chromosome bins.
dist_bin_gr <- calGeneticDist(marker_gr_cos,bin_size = 1e6)
dist_bin_gr[1:5,]
#> GRanges object with 5 ranges and 1 metadata column:
#> seqnames ranges strand | kosambi_cm
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] 1 1-998753 * | 0
#> [2] 1 998754-1997506 * | 0
#> [3] 1 1997507-2996258 * | 0
#> [4] 1 2996259-3995011 * | 0
#> [5] 1 3995012-4993763 * | 0
#> -------
#> seqinfo: 4 sequences from mm10 genome
With genetic distances calculated, we can do a sum of all genetic distances across all marker intervals. We can see that we got the same total genetic distances for marker based intervals and the equally binned intervals.
comapr
also includes functions for visulising genetic
distances of marker intervals or binned intervals.
We can also plot the cumulative genetic distances of certain chromosomes
Multiple chromosomes
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] BiocParallel_1.41.0 GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
#> [4] IRanges_2.41.2 S4Vectors_0.45.2 BiocGenerics_0.53.3
#> [7] generics_0.1.3 comapr_1.11.1 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 shape_1.4.6.1
#> [3] sys_3.4.3 rstudioapi_0.17.1
#> [5] jsonlite_1.8.9 magrittr_2.0.3
#> [7] GenomicFeatures_1.59.1 farver_2.1.2
#> [9] rmarkdown_2.29 GlobalOptions_0.1.2
#> [11] BiocIO_1.17.1 vctrs_0.6.5
#> [13] memoise_2.0.1 Rsamtools_2.23.1
#> [15] RCurl_1.98-1.16 base64enc_0.1-3
#> [17] htmltools_0.5.8.1 S4Arrays_1.7.1
#> [19] progress_1.2.3 curl_6.1.0
#> [21] SparseArray_1.7.4 Formula_1.2-5
#> [23] sass_0.4.9 bslib_0.8.0
#> [25] htmlwidgets_1.6.4 plyr_1.8.9
#> [27] Gviz_1.51.0 httr2_1.1.0
#> [29] plotly_4.10.4 cachem_1.1.0
#> [31] buildtools_1.0.0 GenomicAlignments_1.43.0
#> [33] iterators_1.0.14 lifecycle_1.0.4
#> [35] pkgconfig_2.0.3 Matrix_1.7-1
#> [37] R6_2.5.1 fastmap_1.2.0
#> [39] GenomeInfoDbData_1.2.13 MatrixGenerics_1.19.1
#> [41] digest_0.6.37 colorspace_2.1-1
#> [43] AnnotationDbi_1.69.0 Hmisc_5.2-2
#> [45] RSQLite_2.3.9 labeling_0.4.3
#> [47] filelock_1.0.3 httr_1.4.7
#> [49] abind_1.4-8 compiler_4.4.2
#> [51] withr_3.0.2 bit64_4.6.0-1
#> [53] htmlTable_2.4.3 backports_1.5.0
#> [55] DBI_1.2.3 biomaRt_2.63.0
#> [57] rappdirs_0.3.3 DelayedArray_0.33.4
#> [59] rjson_0.2.23 tools_4.4.2
#> [61] foreign_0.8-88 nnet_7.3-20
#> [63] glue_1.8.0 restfulr_0.0.15
#> [65] grid_4.4.2 checkmate_2.3.2
#> [67] reshape2_1.4.4 cluster_2.1.8
#> [69] gtable_0.3.6 BSgenome_1.75.0
#> [71] tidyr_1.3.1 ensembldb_2.31.0
#> [73] data.table_1.16.4 hms_1.1.3
#> [75] xml2_1.3.6 XVector_0.47.2
#> [77] foreach_1.5.2 pillar_1.10.1
#> [79] stringr_1.5.1 circlize_0.4.16
#> [81] dplyr_1.1.4 BiocFileCache_2.15.1
#> [83] lattice_0.22-6 rtracklayer_1.67.0
#> [85] bit_4.5.0.1 deldir_2.0-4
#> [87] biovizBase_1.55.0 tidyselect_1.2.1
#> [89] maketools_1.3.1 Biostrings_2.75.3
#> [91] knitr_1.49 gridExtra_2.3
#> [93] ProtGenerics_1.39.2 SummarizedExperiment_1.37.0
#> [95] xfun_0.50 Biobase_2.67.0
#> [97] matrixStats_1.5.0 stringi_1.8.4
#> [99] UCSC.utils_1.3.1 lazyeval_0.2.2
#> [101] yaml_2.3.10 evaluate_1.0.3
#> [103] codetools_0.2-20 interp_1.1-6
#> [105] tibble_3.2.1 BiocManager_1.30.25
#> [107] cli_3.6.3 rpart_4.1.24
#> [109] munsell_0.5.1 jquerylib_0.1.4
#> [111] dichromat_2.0-0.1 Rcpp_1.0.14
#> [113] dbplyr_2.5.0 png_0.1-8
#> [115] XML_3.99-0.18 parallel_4.4.2
#> [117] ggplot2_3.5.1 blob_1.2.4
#> [119] prettyunits_1.2.0 latticeExtra_0.6-30
#> [121] jpeg_0.1-10 AnnotationFilter_1.31.0
#> [123] bitops_1.0-9 viridisLite_0.4.2
#> [125] VariantAnnotation_1.53.1 scales_1.3.0
#> [127] purrr_1.0.2 crayon_1.5.3
#> [129] rlang_1.1.5 KEGGREST_1.47.0