single-sperm-co-analysis

suppressPackageStartupMessages({
  library(comapr)
  library(SummarizedExperiment)
  })

Introduction

In the previous vignette, we demonstrated how to calculate genetic distances using genotyped markers from a group of samples.

Genetic distance calculate from genotype shifts of markers

In this document, we will focus on building individualized genetic maps from output files of sscocaller which is available at here.

Locate file path

The comapr package includes a list of toy example output files from the sscocaller command line tool. The follow code will get the file path that we will use later.

demo_path <-paste0(system.file("extdata",package = "comapr"),"/")

We can see that we have two samples (individual donors) and each of them has haplotype states inferred for chr1 to chr5.

list.files(demo_path)
#>  [1] "s1_barcodes.txt"        "s1_chr1_altCount.mtx"   "s1_chr1_snpAnnot.txt"  
#>  [4] "s1_chr1_totalCount.mtx" "s1_chr1_vi.mtx"         "s1_chr1_viSegInfo.txt" 
#>  [7] "s1_chr2_altCount.mtx"   "s1_chr2_snpAnnot.txt"   "s1_chr2_totalCount.mtx"
#> [10] "s1_chr2_vi.mtx"         "s1_chr2_viSegInfo.txt"  "s1_chr3_altCount.mtx"  
#> [13] "s1_chr3_snpAnnot.txt"   "s1_chr3_totalCount.mtx" "s1_chr3_vi.mtx"        
#> [16] "s1_chr3_viSegInfo.txt"  "s1_chr4_altCount.mtx"   "s1_chr4_snpAnnot.txt"  
#> [19] "s1_chr4_totalCount.mtx" "s1_chr4_vi.mtx"         "s1_chr4_viSegInfo.txt" 
#> [22] "s1_chr5_altCount.mtx"   "s1_chr5_snpAnnot.txt"   "s1_chr5_totalCount.mtx"
#> [25] "s1_chr5_vi.mtx"         "s1_chr5_viSegInfo.txt"  "s2_barcodes.txt"       
#> [28] "s2_chr1_snpAnnot.txt"   "s2_chr1_vi.mtx"         "s2_chr1_viSegInfo.txt" 
#> [31] "s2_chr2_snpAnnot.txt"   "s2_chr2_vi.mtx"         "s2_chr2_viSegInfo.txt" 
#> [34] "s2_chr3_snpAnnot.txt"   "s2_chr3_vi.mtx"         "s2_chr3_viSegInfo.txt" 
#> [37] "s2_chr4_snpAnnot.txt"   "s2_chr4_vi.mtx"         "s2_chr4_viSegInfo.txt" 
#> [40] "s2_chr5_snpAnnot.txt"   "s2_chr5_vi.mtx"         "s2_chr5_viSegInfo.txt"

File information

  • *.mtx
    • sparse matrix with columns corresponding to the list of sperm cell barcodes and rows corresponding to the list of SNP positions in VCF file
    • {sample}_chr1_altCount.mtx, a sparse mtx with entries representing alternative allele counts
    • {sample}_chr1_totalCount.mtx, a sparse mtx with entries representing total allele counts
    • {sample}_chr1_vi.mtx, a sparse mtx with entries representing inferred viterbi state (haplotype state)
  • {sample}_chr1_snpAnnot.txt, the SNP positions and allele
  • {sample}_chr1_SegInfo.txt, statistics of viterbi state segments in text file format. It contains consecutive viterbi states for each chromosome with statistics including, starting SNP position, ending SNP position, the number of SNPs supporting the segment, the log likelihood ratio of the viterbi segment and the inferred hidden state.

Diagnostic functions

comapr provides quality-check functions for examining SNP coverage per chr and per cell, chromosome segregation pattern checks, and summary statics for filtering low confidence crossovers.

perCellChrQC

pcQC <- perCellChrQC(sampleName="s1",chroms=c("chr1"),
                     path=paste0(demo_path,"/"),
                     barcodeFile=NULL)

Input parsing

Input-parsing functions are implemented in comapr to construct RangedSummarizedExpriment object that parses files generated from sscocaller and filter out low-confidence COs at the same time. For the demo dataset, these filters do not have any effects:

minSNP = 0, 
minlogllRatio = 50,
bpDist = 100,
maxRawCO=10,
minCellSNP = 1

Construct RangedSummarizedExpriment object

We first construct RangedSummarizedExpriment object from parsing output files from sscocaller and filter out low-confidence crossovers.

s1_rse_state <- readHapState("s1",chroms=c("chr1","chr2"),
                             path=demo_path,barcodeFile=NULL,
                             minSNP = 0,
                             minlogllRatio = 50,
                             bpDist = 100,
                             maxRawCO=10,
                             minCellSNP = 1)

s2_rse_state <- readHapState("s2",chroms=c("chr1","chr2"),
                             path=demo_path,
                             barcodeFile=NULL,
                             minSNP = 0,
                             minlogllRatio = 50,
                             bpDist = 100,
                             maxRawCO=10,
                             minCellSNP = 1)
s1_rse_state
#> class: RangedSummarizedExperiment 
#> dim: 12 5 
#> metadata(10): ithSperm Seg_start ... bp_dist barcode
#> assays(1): vi_state
#> rownames: NULL
#> rowData names(0):
#> colnames(5): BC1 BC2 BC3 BC4 BC5
#> colData names(1): barcodes

The Viterbi states for SNP markers are stored in the “assay” slot:

assay(s1_rse_state)
#> 12 x 5 sparse Matrix of class "dgCMatrix"
#>       BC1 BC2 BC3 BC4 BC5
#>  [1,]   1   2   1   2   1
#>  [2,]   1   .   .   .   .
#>  [3,]   .   2   .   .   .
#>  [4,]   .   .   .   .   1
#>  [5,]   2   1   .   .   .
#>  [6,]   2   1   1   2   .
#>  [7,]   1   2   1   2   1
#>  [8,]   1   .   .   .   .
#>  [9,]   .   2   .   .   .
#> [10,]   .   .   .   .   1
#> [11,]   2   1   .   .   .
#> [12,]   2   1   1   2   .

The rowRanges contains the SNP’s positions:

rowRanges(s1_rse_state)
#> GRanges object with 12 ranges and 0 metadata columns:
#>        seqnames    ranges strand
#>           <Rle> <IRanges>  <Rle>
#>    [1]     chr1      3000      *
#>    [2]     chr1      3200      *
#>    [3]     chr1      4000      *
#>    [4]     chr1      4500      *
#>    [5]     chr1      5500      *
#>    ...      ...       ...    ...
#>    [8]     chr2      3200      *
#>    [9]     chr2      4000      *
#>   [10]     chr2      4500      *
#>   [11]     chr2      5500      *
#>   [12]     chr2      6000      *
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Formate sample group factor

We have read in the Viterbi states for cells from two samples: s1 and s2. We now combine them into one data object for downstream analysis.

Add sample group factor

colData(s1_rse_state)
#> DataFrame with 5 rows and 1 column
#>        barcodes
#>     <character>
#> BC1         BC1
#> BC2         BC2
#> BC3         BC3
#> BC4         BC4
#> BC5         BC5

colData(s1_rse_state)$sampleGroup <- "s1"

colData(s2_rse_state)$sampleGroup <- "s2"

Combine two groups

We now call combineHapState function to combine sample s1 and sample s2 into one RangedSummarizedExperiment object.

twoSample <- combineHapState(s1_rse_state,
                             s2_rse_state)

Now the assay data slot contains the Viterbi states across SNP positions for the combined samples.

twoSample <- combineHapState(s1_rse_state,s2_rse_state)

Now the twoSample object contains the cells from both samples with assay slot contains the Viterbi states and rowRanges contains position annotaitons for the list SNP markers.


assay(twoSample)
#> 12 x 10 sparse Matrix of class "dgCMatrix"
#>   [[ suppressing 10 column names 'BC1', 'BC2', 'BC3' ... ]]
#>                          
#>  [1,] 1 2 1 2 1 1 2 1 2 1
#>  [2,] 1 . . . . . . . . .
#>  [3,] . 2 . . . . 2 . . .
#>  [4,] . . . . 1 . . . . 1
#>  [5,] 2 1 . . . . 1 . . .
#>  [6,] 2 1 1 2 . 1 1 1 2 .
#>  [7,] 1 2 1 2 1 1 2 1 2 1
#>  [8,] 1 . . . . . . . . .
#>  [9,] . 2 . . . . 2 . . .
#> [10,] . . . . 1 . . . . 1
#> [11,] 2 1 . . . . 1 . . .
#> [12,] 2 1 1 2 . 1 1 1 2 .

Count crossovers

The countCOs function can then find out the crossover intervals according to the Viterbi states for each cell and the number of crossovers per cell is then calculated.

Count crossovers for SNP intervals

countCOs function will find the crossover intervals for each samples and attribute the observed crossovers from each sample to the corresponding intervals.

cocounts <- countCOs(twoSample)

The rowRanges from the resulting data object now denotes the crossover interval and the assay slot now contains the number of crossovers in each cell.

Now rowRanges contains the intervals

rowRanges(cocounts)
#> GRanges object with 6 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1 3201-3999      *
#>   [2]     chr1 4001-4499      *
#>   [3]     chr1 4501-5499      *
#>   [4]     chr2 3201-3999      *
#>   [5]     chr2 4001-4499      *
#>   [6]     chr2 4501-5499      *
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

The colData slot still contains the annotations of each cell.

colData(cocounts)
#> DataFrame with 10 rows and 2 columns
#>        barcodes sampleGroup
#>     <character> <character>
#> BC1         BC1          s1
#> BC2         BC2          s1
#> BC3         BC3          s1
#> BC4         BC4          s1
#> BC5         BC5          s1
#> BCa         BCa          s2
#> BCb         BCb          s2
#> BCc         BCc          s2
#> BCd         BCd          s2
#> BCe         BCe          s2

Calculate genetic distances

The genetic distances can be calculated by using mapping functions such as the Kosambi or Haldane and assay slot contains the number of crossovers found in each sample across these intervals.

assay(cocounts)
#> DataFrame with 6 rows and 10 columns
#>         BC1       BC2       BC3       BC4       BC5       BCa       BCb
#>   <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> 1  0.347826  0.000000         0         0         0         0  0.000000
#> 2  0.217391  0.333333         0         0         0         0  0.333333
#> 3  0.434783  0.666667         0         0         0         0  0.666667
#> 4  0.347826  0.000000         0         0         0         0  0.000000
#> 5  0.217391  0.333333         0         0         0         0  0.333333
#> 6  0.434783  0.666667         0         0         0         0  0.666667
#>         BCc       BCd       BCe
#>   <numeric> <numeric> <numeric>
#> 1         0         0         0
#> 2         0         0         0
#> 3         0         0         0
#> 4         0         0         0
#> 5         0         0         0
#> 6         0         0         0

Calculate genetic distances

calGeneticDist function will convert the raw crossover frequencies to genetic distances via selected mapping function (ie. kosambi or haldane).

dist_gr <- calGeneticDist(co_count = cocounts, mapping_fun = "k")
dist_gr
#> class: RangedSummarizedExperiment 
#> dim: 6 10 
#> metadata(0):
#> assays(1): co_count
#> rownames: NULL
#> rowData names(2): raw_rate kosambi
#> colnames(10): BC1 BC2 ... BCd BCe
#> colData names(2): barcodes sampleGroup

The genetic distances for each interval are stored in rowData.

rowData(dist_gr)
#> DataFrame with 6 rows and 2 columns
#>    raw_rate   kosambi
#>   <numeric> <numeric>
#> 1 0.0347826   3.48389
#> 2 0.0884058   8.93447
#> 3 0.1768116  18.47894
#> 4 0.0347826   3.48389
#> 5 0.0884058   8.93447
#> 6 0.1768116  18.47894

The above genetic distances have been calculated using all samples. We can also specify the group factor so that we can calculate genetic distances for different sample groups:

## sampleGroup is a column in the colData slot
dist_gr <- calGeneticDist(co_count = cocounts,
                          group_by  = "sampleGroup",
                          mapping_fun = "k")

Now the group/Sample specific distances are stored in rowData

rowData(dist_gr)$kosambi
#>             s1       s2
#> [1,]  7.001937  0.00000
#> [2,] 11.198036  6.70660
#> [3,] 23.647496 13.66359
#> [4,]  7.001937  0.00000
#> [5,] 11.198036  6.70660
#> [6,] 23.647496 13.66359

Plot whole genome genetic distances

We construct a GRange object from the dist_gr first.

p_gr <- granges(dist_gr)
mcols(p_gr) <- rowData(dist_gr)$kosambi

We can plot whole-genome genetic distances

plotWholeGenome(p_gr)

We can also do per chromosome plot

plotGeneticDist(p_gr,chr = "chr1",cumulative = TRUE)

Group differences

bootstrapDist function generates the sampling distribution of the difference in total genetic distances for the two groups under comparisons.

set.seed(100)
bootsDiff <- bootstrapDist(co_gr = cocounts,B=10,
                           group_by = "sampleGroup")
hist(bootsDiff)

From the bootsDiff data object, we can find a 95% confidence interval to test whether the difference is statistically significant.

quantile(bootsDiff,c(0.025,0.975),na.rm =TRUE)
#>     2.5%    97.5% 
#> -96.7663 127.3383

An alternative re-sampling method, permuteDist, can be applied to generate a null distribution for the group difference by reshuffling the group labels across the samples.

set.seed(100)
perms <- permuteDist(cocounts,B=1000,group_by = "sampleGroup")

A p-value can be calculated using the statmod::permp function (Phipson and Smyth 2010).

statmod::permp(x = sum(perms$permutes>=perms$observed_diff,
                       na.rm = TRUE),
               nperm = sum(!is.na(perms$permutes)),
               n1 = perms$nSample[1],
               n2 = perms$nSample[2])
#> [1] 0.4715562

Session info

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SummarizedExperiment_1.37.0 Biobase_2.67.0             
#>  [3] MatrixGenerics_1.19.1       matrixStats_1.5.0          
#>  [5] BiocParallel_1.41.0         GenomicRanges_1.59.1       
#>  [7] GenomeInfoDb_1.43.2         IRanges_2.41.2             
#>  [9] S4Vectors_0.45.2            BiocGenerics_0.53.3        
#> [11] generics_0.1.3              comapr_1.11.1              
#> [13] BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3       shape_1.4.6.1            sys_3.4.3               
#>   [4] rstudioapi_0.17.1        jsonlite_1.8.9           magrittr_2.0.3          
#>   [7] GenomicFeatures_1.59.1   farver_2.1.2             rmarkdown_2.29          
#>  [10] GlobalOptions_0.1.2      BiocIO_1.17.1            vctrs_0.6.5             
#>  [13] memoise_2.0.1            Rsamtools_2.23.1         RCurl_1.98-1.16         
#>  [16] base64enc_0.1-3          htmltools_0.5.8.1        S4Arrays_1.7.1          
#>  [19] progress_1.2.3           curl_6.1.0               SparseArray_1.7.4       
#>  [22] Formula_1.2-5            sass_0.4.9               bslib_0.8.0             
#>  [25] htmlwidgets_1.6.4        plyr_1.8.9               Gviz_1.51.0             
#>  [28] httr2_1.1.0              plotly_4.10.4            cachem_1.1.0            
#>  [31] buildtools_1.0.0         GenomicAlignments_1.43.0 iterators_1.0.14        
#>  [34] lifecycle_1.0.4          pkgconfig_2.0.3          Matrix_1.7-1            
#>  [37] R6_2.5.1                 fastmap_1.2.0            GenomeInfoDbData_1.2.13 
#>  [40] digest_0.6.37            colorspace_2.1-1         AnnotationDbi_1.69.0    
#>  [43] Hmisc_5.2-2              RSQLite_2.3.9            labeling_0.4.3          
#>  [46] filelock_1.0.3           httr_1.4.7               abind_1.4-8             
#>  [49] compiler_4.4.2           withr_3.0.2              bit64_4.6.0-1           
#>  [52] htmlTable_2.4.3          backports_1.5.0          DBI_1.2.3               
#>  [55] biomaRt_2.63.0           rappdirs_0.3.3           DelayedArray_0.33.4     
#>  [58] rjson_0.2.23             tools_4.4.2              foreign_0.8-88          
#>  [61] nnet_7.3-20              glue_1.8.0               restfulr_0.0.15         
#>  [64] grid_4.4.2               checkmate_2.3.2          reshape2_1.4.4          
#>  [67] cluster_2.1.8            gtable_0.3.6             BSgenome_1.75.0         
#>  [70] tidyr_1.3.1              ensembldb_2.31.0         data.table_1.16.4       
#>  [73] hms_1.1.3                xml2_1.3.6               XVector_0.47.2          
#>  [76] foreach_1.5.2            pillar_1.10.1            stringr_1.5.1           
#>  [79] circlize_0.4.16          dplyr_1.1.4              BiocFileCache_2.15.1    
#>  [82] lattice_0.22-6           rtracklayer_1.67.0       bit_4.5.0.1             
#>  [85] deldir_2.0-4             biovizBase_1.55.0        tidyselect_1.2.1        
#>  [88] maketools_1.3.1          Biostrings_2.75.3        knitr_1.49              
#>  [91] gridExtra_2.3            ProtGenerics_1.39.2      xfun_0.50               
#>  [94] statmod_1.5.0            stringi_1.8.4            UCSC.utils_1.3.1        
#>  [97] lazyeval_0.2.2           yaml_2.3.10              evaluate_1.0.3          
#> [100] codetools_0.2-20         interp_1.1-6             tibble_3.2.1            
#> [103] BiocManager_1.30.25      cli_3.6.3                rpart_4.1.24            
#> [106] munsell_0.5.1            jquerylib_0.1.4          dichromat_2.0-0.1       
#> [109] Rcpp_1.0.14              dbplyr_2.5.0             png_0.1-8               
#> [112] XML_3.99-0.18            parallel_4.4.2           ggplot2_3.5.1           
#> [115] blob_1.2.4               prettyunits_1.2.0        latticeExtra_0.6-30     
#> [118] jpeg_0.1-10              AnnotationFilter_1.31.0  bitops_1.0-9            
#> [121] viridisLite_0.4.2        VariantAnnotation_1.53.1 scales_1.3.0            
#> [124] purrr_1.0.2              crayon_1.5.3             rlang_1.1.5             
#> [127] KEGGREST_1.47.0
Phipson, Belinda, and Gordon K Smyth. 2010. “Permutation p-Values Should Never Be Zero: Calculating Exact p-Values When Permutations Are Randomly Drawn.” Stat. Appl. Genet. Mol. Biol. 9 (October): Article39.