In the previous vignette, we demonstrated how to calculate genetic distances using genotyped markers from a group of samples.
Genetic distance calculate from genotype shifts of markers
In this document, we will focus on building individualized genetic
maps from output files of sscocaller
which is available at
here.
The comapr
package includes a list of toy example output
files from the sscocaller
command line tool. The follow
code will get the file path that we will use later.
We can see that we have two samples (individual donors) and each of them has haplotype states inferred for chr1 to chr5.
list.files(demo_path)
#> [1] "s1_barcodes.txt" "s1_chr1_altCount.mtx" "s1_chr1_snpAnnot.txt"
#> [4] "s1_chr1_totalCount.mtx" "s1_chr1_vi.mtx" "s1_chr1_viSegInfo.txt"
#> [7] "s1_chr2_altCount.mtx" "s1_chr2_snpAnnot.txt" "s1_chr2_totalCount.mtx"
#> [10] "s1_chr2_vi.mtx" "s1_chr2_viSegInfo.txt" "s1_chr3_altCount.mtx"
#> [13] "s1_chr3_snpAnnot.txt" "s1_chr3_totalCount.mtx" "s1_chr3_vi.mtx"
#> [16] "s1_chr3_viSegInfo.txt" "s1_chr4_altCount.mtx" "s1_chr4_snpAnnot.txt"
#> [19] "s1_chr4_totalCount.mtx" "s1_chr4_vi.mtx" "s1_chr4_viSegInfo.txt"
#> [22] "s1_chr5_altCount.mtx" "s1_chr5_snpAnnot.txt" "s1_chr5_totalCount.mtx"
#> [25] "s1_chr5_vi.mtx" "s1_chr5_viSegInfo.txt" "s2_barcodes.txt"
#> [28] "s2_chr1_snpAnnot.txt" "s2_chr1_vi.mtx" "s2_chr1_viSegInfo.txt"
#> [31] "s2_chr2_snpAnnot.txt" "s2_chr2_vi.mtx" "s2_chr2_viSegInfo.txt"
#> [34] "s2_chr3_snpAnnot.txt" "s2_chr3_vi.mtx" "s2_chr3_viSegInfo.txt"
#> [37] "s2_chr4_snpAnnot.txt" "s2_chr4_vi.mtx" "s2_chr4_viSegInfo.txt"
#> [40] "s2_chr5_snpAnnot.txt" "s2_chr5_vi.mtx" "s2_chr5_viSegInfo.txt"
comapr
provides quality-check functions for examining
SNP coverage per chr and per cell, chromosome segregation pattern
checks, and summary statics for filtering low confidence crossovers.
Input-parsing functions are implemented in comapr
to
construct RangedSummarizedExpriment
object that parses
files generated from sscocaller
and filter out
low-confidence COs at the same time. For the demo dataset, these filters
do not have any effects:
minSNP = 0,
minlogllRatio = 50,
bpDist = 100,
maxRawCO=10,
minCellSNP = 1
RangedSummarizedExpriment
objectWe first construct RangedSummarizedExpriment
object from
parsing output files from sscocaller
and filter out
low-confidence crossovers.
s1_rse_state <- readHapState("s1",chroms=c("chr1","chr2"),
path=demo_path,barcodeFile=NULL,
minSNP = 0,
minlogllRatio = 50,
bpDist = 100,
maxRawCO=10,
minCellSNP = 1)
s2_rse_state <- readHapState("s2",chroms=c("chr1","chr2"),
path=demo_path,
barcodeFile=NULL,
minSNP = 0,
minlogllRatio = 50,
bpDist = 100,
maxRawCO=10,
minCellSNP = 1)
s1_rse_state
#> class: RangedSummarizedExperiment
#> dim: 12 5
#> metadata(10): ithSperm Seg_start ... bp_dist barcode
#> assays(1): vi_state
#> rownames: NULL
#> rowData names(0):
#> colnames(5): BC1 BC2 BC3 BC4 BC5
#> colData names(1): barcodes
The Viterbi states for SNP markers are stored in the “assay” slot:
assay(s1_rse_state)
#> 12 x 5 sparse Matrix of class "dgCMatrix"
#> BC1 BC2 BC3 BC4 BC5
#> [1,] 1 2 1 2 1
#> [2,] 1 . . . .
#> [3,] . 2 . . .
#> [4,] . . . . 1
#> [5,] 2 1 . . .
#> [6,] 2 1 1 2 .
#> [7,] 1 2 1 2 1
#> [8,] 1 . . . .
#> [9,] . 2 . . .
#> [10,] . . . . 1
#> [11,] 2 1 . . .
#> [12,] 2 1 1 2 .
The rowRanges
contains the SNP’s positions:
rowRanges(s1_rse_state)
#> GRanges object with 12 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 3000 *
#> [2] chr1 3200 *
#> [3] chr1 4000 *
#> [4] chr1 4500 *
#> [5] chr1 5500 *
#> ... ... ... ...
#> [8] chr2 3200 *
#> [9] chr2 4000 *
#> [10] chr2 4500 *
#> [11] chr2 5500 *
#> [12] chr2 6000 *
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
We have read in the Viterbi states for cells from two samples: s1 and s2. We now combine them into one data object for downstream analysis.
We now call combineHapState
function to combine sample
s1 and sample s2 into one RangedSummarizedExperiment
object.
Now the assay
data slot contains the Viterbi states
across SNP positions for the combined samples.
Now the twoSample
object contains the cells from both
samples with assay
slot contains the Viterbi states and
rowRanges
contains position annotaitons for the list SNP
markers.
assay(twoSample)
#> 12 x 10 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 10 column names 'BC1', 'BC2', 'BC3' ... ]]
#>
#> [1,] 1 2 1 2 1 1 2 1 2 1
#> [2,] 1 . . . . . . . . .
#> [3,] . 2 . . . . 2 . . .
#> [4,] . . . . 1 . . . . 1
#> [5,] 2 1 . . . . 1 . . .
#> [6,] 2 1 1 2 . 1 1 1 2 .
#> [7,] 1 2 1 2 1 1 2 1 2 1
#> [8,] 1 . . . . . . . . .
#> [9,] . 2 . . . . 2 . . .
#> [10,] . . . . 1 . . . . 1
#> [11,] 2 1 . . . . 1 . . .
#> [12,] 2 1 1 2 . 1 1 1 2 .
The countCOs
function can then find out the crossover
intervals according to the Viterbi states for each cell and the number
of crossovers per cell is then calculated.
countCOs
function will find the crossover intervals for
each samples and attribute the observed crossovers from each sample to
the corresponding intervals.
The rowRanges
from the resulting data object now denotes
the crossover interval and the assay
slot now contains the
number of crossovers in each cell.
Now rowRanges
contains the intervals
rowRanges(cocounts)
#> GRanges object with 6 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 3201-3999 *
#> [2] chr1 4001-4499 *
#> [3] chr1 4501-5499 *
#> [4] chr2 3201-3999 *
#> [5] chr2 4001-4499 *
#> [6] chr2 4501-5499 *
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
The colData slot still contains the annotations of each cell.
The genetic distances can be calculated by using mapping functions
such as the Kosambi or Haldane and assay
slot contains the
number of crossovers found in each sample across these intervals.
assay(cocounts)
#> DataFrame with 6 rows and 10 columns
#> BC1 BC2 BC3 BC4 BC5 BCa BCb
#> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> 1 0.347826 0.000000 0 0 0 0 0.000000
#> 2 0.217391 0.333333 0 0 0 0 0.333333
#> 3 0.434783 0.666667 0 0 0 0 0.666667
#> 4 0.347826 0.000000 0 0 0 0 0.000000
#> 5 0.217391 0.333333 0 0 0 0 0.333333
#> 6 0.434783 0.666667 0 0 0 0 0.666667
#> BCc BCd BCe
#> <numeric> <numeric> <numeric>
#> 1 0 0 0
#> 2 0 0 0
#> 3 0 0 0
#> 4 0 0 0
#> 5 0 0 0
#> 6 0 0 0
calGeneticDist
function will convert the raw crossover
frequencies to genetic distances via selected mapping function (ie.
kosambi or haldane).
dist_gr <- calGeneticDist(co_count = cocounts, mapping_fun = "k")
dist_gr
#> class: RangedSummarizedExperiment
#> dim: 6 10
#> metadata(0):
#> assays(1): co_count
#> rownames: NULL
#> rowData names(2): raw_rate kosambi
#> colnames(10): BC1 BC2 ... BCd BCe
#> colData names(2): barcodes sampleGroup
The genetic distances for each interval are stored in
rowData
.
rowData(dist_gr)
#> DataFrame with 6 rows and 2 columns
#> raw_rate kosambi
#> <numeric> <numeric>
#> 1 0.0347826 3.48389
#> 2 0.0884058 8.93447
#> 3 0.1768116 18.47894
#> 4 0.0347826 3.48389
#> 5 0.0884058 8.93447
#> 6 0.1768116 18.47894
The above genetic distances have been calculated using all samples. We can also specify the group factor so that we can calculate genetic distances for different sample groups:
## sampleGroup is a column in the colData slot
dist_gr <- calGeneticDist(co_count = cocounts,
group_by = "sampleGroup",
mapping_fun = "k")
Now the group/Sample specific distances are stored in
rowData
We construct a GRange
object from the
dist_gr
first.
We can plot whole-genome genetic distances
We can also do per chromosome plot
bootstrapDist
function generates the sampling
distribution of the difference in total genetic distances for the two
groups under comparisons.
From the bootsDiff
data object, we can find a 95%
confidence interval to test whether the difference is statistically
significant.
An alternative re-sampling method, permuteDist
, can be
applied to generate a null distribution for the group difference by
reshuffling the group labels across the samples.
A p-value can be calculated using the statmod::permp
function (Phipson and Smyth 2010).
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.37.0 Biobase_2.67.0
#> [3] MatrixGenerics_1.19.1 matrixStats_1.5.0
#> [5] BiocParallel_1.41.0 GenomicRanges_1.59.1
#> [7] GenomeInfoDb_1.43.2 IRanges_2.41.2
#> [9] S4Vectors_0.45.2 BiocGenerics_0.53.3
#> [11] generics_0.1.3 comapr_1.11.1
#> [13] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 shape_1.4.6.1 sys_3.4.3
#> [4] rstudioapi_0.17.1 jsonlite_1.8.9 magrittr_2.0.3
#> [7] GenomicFeatures_1.59.1 farver_2.1.2 rmarkdown_2.29
#> [10] GlobalOptions_0.1.2 BiocIO_1.17.1 vctrs_0.6.5
#> [13] memoise_2.0.1 Rsamtools_2.23.1 RCurl_1.98-1.16
#> [16] base64enc_0.1-3 htmltools_0.5.8.1 S4Arrays_1.7.1
#> [19] progress_1.2.3 curl_6.1.0 SparseArray_1.7.4
#> [22] Formula_1.2-5 sass_0.4.9 bslib_0.8.0
#> [25] htmlwidgets_1.6.4 plyr_1.8.9 Gviz_1.51.0
#> [28] httr2_1.1.0 plotly_4.10.4 cachem_1.1.0
#> [31] buildtools_1.0.0 GenomicAlignments_1.43.0 iterators_1.0.14
#> [34] lifecycle_1.0.4 pkgconfig_2.0.3 Matrix_1.7-1
#> [37] R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
#> [40] digest_0.6.37 colorspace_2.1-1 AnnotationDbi_1.69.0
#> [43] Hmisc_5.2-2 RSQLite_2.3.9 labeling_0.4.3
#> [46] filelock_1.0.3 httr_1.4.7 abind_1.4-8
#> [49] compiler_4.4.2 withr_3.0.2 bit64_4.6.0-1
#> [52] htmlTable_2.4.3 backports_1.5.0 DBI_1.2.3
#> [55] biomaRt_2.63.0 rappdirs_0.3.3 DelayedArray_0.33.4
#> [58] rjson_0.2.23 tools_4.4.2 foreign_0.8-88
#> [61] nnet_7.3-20 glue_1.8.0 restfulr_0.0.15
#> [64] grid_4.4.2 checkmate_2.3.2 reshape2_1.4.4
#> [67] cluster_2.1.8 gtable_0.3.6 BSgenome_1.75.0
#> [70] tidyr_1.3.1 ensembldb_2.31.0 data.table_1.16.4
#> [73] hms_1.1.3 xml2_1.3.6 XVector_0.47.2
#> [76] foreach_1.5.2 pillar_1.10.1 stringr_1.5.1
#> [79] circlize_0.4.16 dplyr_1.1.4 BiocFileCache_2.15.1
#> [82] lattice_0.22-6 rtracklayer_1.67.0 bit_4.5.0.1
#> [85] deldir_2.0-4 biovizBase_1.55.0 tidyselect_1.2.1
#> [88] maketools_1.3.1 Biostrings_2.75.3 knitr_1.49
#> [91] gridExtra_2.3 ProtGenerics_1.39.2 xfun_0.50
#> [94] statmod_1.5.0 stringi_1.8.4 UCSC.utils_1.3.1
#> [97] lazyeval_0.2.2 yaml_2.3.10 evaluate_1.0.3
#> [100] codetools_0.2-20 interp_1.1-6 tibble_3.2.1
#> [103] BiocManager_1.30.25 cli_3.6.3 rpart_4.1.24
#> [106] munsell_0.5.1 jquerylib_0.1.4 dichromat_2.0-0.1
#> [109] Rcpp_1.0.14 dbplyr_2.5.0 png_0.1-8
#> [112] XML_3.99-0.18 parallel_4.4.2 ggplot2_3.5.1
#> [115] blob_1.2.4 prettyunits_1.2.0 latticeExtra_0.6-30
#> [118] jpeg_0.1-10 AnnotationFilter_1.31.0 bitops_1.0-9
#> [121] viridisLite_0.4.2 VariantAnnotation_1.53.1 scales_1.3.0
#> [124] purrr_1.0.2 crayon_1.5.3 rlang_1.1.5
#> [127] KEGGREST_1.47.0