scoreInvHap genotypes inversions using SNP data. This method computes a similarity score between the available SNPs of an individual and experimental references for the three inversion-genotypes; NN: non-inverted homozygous, NI: inverted heterozygous and II: inverted homozygous. Individuals are classified into the inversion-genotypes with the highest score. There are other approaches to genotype inversions from SNP data: inveRsion, invClust or PFIDO. However, these approaches have limitations including:
they are based in population sample inferences. When a new individual is included in the study the whole population sample needs to be reanalyzed. As they depend on dimensionality reduction methods, they are slow in the analysis of several individuals.
they can only handle limited number of samples. They need large samples sizes for accurate inferences.
They are highly sensitive to SNP array densities. Their accuracy depends on how clear the population sample can be partitioned, which depends on the amount and quality of informative SNPs.
They need external information to label the inversion-genotype clusters. External validation of the clustering is performed at the end of the inference. If there are more than three clusters, it is not clear how which inversion genotype should be assigned to the clusters.
scoreInvHap overcomes these difficulties by using a set of reference genotypes from the inversion of interest. The methods is a supervised classifier that genotypes each individual according to their SNP similarities to the reference genotypes across the inverted segment. The classifier in particular uses the linkage desequillibrium (R2) between the SNPs and the inversion genoptypes, and the SNPs of each reference inversion-genotypes.
The package is loaded as usual
scoreInvHap takes as input SNP data formated in either
SNPMatrix
or VCF
Bioconductor’s objects. In
the case of having data in SNPMatrix
format, a list with
two elements is required:
If data is originally available in PLINK files (.bed, .bim), we can
use the functions of snpStats
to load the data as SNPMatrix
object
library(snpStats)
## From a bed
snps <- read.plink("example.bed")
## From a pedfile
snps <- read.pedfile("example.ped", snps = "example.map")
In both cases, data is readily formatted for scoreInvHap.
If data is available in a variant call format (.vcf) file, we can
load it in R into a VCF
object, using the VariantAnnotation
package. scoreInvHap includes a small vcf in
scoreInvHap as a demo. This file contains genotyped and imputed
SNPs within inversion 7p11.2 for 30 European individuals from the 1000
Genomes project. We can load the vcf as follows
library(VariantAnnotation)
vcf_file <- system.file("extdata", "example.vcf", package = "scoreInvHap")
vcf <- readVcf(vcf_file, "hg19")
vcf
## class: CollapsedVCF
## dim: 380 30
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 4 columns: AF, MAF, R2, ER2
## info(header(vcf)):
## Number Type Description
## AF 1 Float Estimated Alternate Allele Frequency
## MAF 1 Float Estimated Minor Allele Frequency
## R2 1 Float Estimated Imputation Accuracy
## ER2 1 Float Empirical (Leave-One-Out) R-square (available only for g...
## geno(vcf):
## List of length 3: GT, DS, GP
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## DS 1 Float Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]
## GP 3 Float Estimated Posterior Probabilities for Genotypes 0/0, 0/1...
The object vcf
contains 380 SNPs and 30 individuals.
scoreInvHap references were created using 1000 Genomes data. Thus, SNPs are annotated to the hg19 build in the plus strand. We have included a function that checks if input SNPs correspond to scoreInvHap references:
## $genos
## class: CollapsedVCF
## dim: 307 30
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 4 columns: AF, MAF, R2, ER2
## info(header(vcf)):
## Number Type Description
## AF 1 Float Estimated Alternate Allele Frequency
## MAF 1 Float Estimated Minor Allele Frequency
## R2 1 Float Estimated Imputation Accuracy
## ER2 1 Float Empirical (Leave-One-Out) R-square (available only for g...
## geno(vcf):
## List of length 3: GT, DS, GP
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## DS 1 Float Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]
## GP 3 Float Estimated Posterior Probabilities for Genotypes 0/0, 0/1...
##
## $wrongAlleles
## NULL
##
## $wrongFreqs
## NULL
The function checkSNPs
checks if the alleles in the SNP
object are equal to those in scoreInvHap references. The
function also tests if the frequencies are similar in the input data and
in the references. checkSNPs
returns a list with three
elements. wrongAlleles are the SNPs with different alleles, wrongFreqs
are the SNPs with different allele frequencies and genos
is
an object with the genotype data without SNPs failing any of the
checks.
Now, we illustrate how to perform the inversion genotyping of inv7p11.2 for these data. The inversion calling requires two pieces of information:
SNPlist
),Currently, there are 20 inversion references included in scoreInvHap. We included a table with their coordinates and scoreInvHap labels at the end of this document.
inv7p11.2 is labeled as inv7_005
in
scoreInvHap. We then obtain the inversion genotypes for the 30
individuals in our dataset as follows
## scoreInvHapRes
## Samples: 30
## Genotypes' table:
## II NI NN
## 6 17 7
## - Inversion genotypes' table:
## NN NI II
## 7 17 6
## - Inversion frequency: 48.33%
This function has a parameter called BPPARAM
that allows
paralell computing using BiocParallel
infrastructure. For
example, parallel computation using 8 cores would be run by
executing
The results of scoreInvHap
are encapsulated in a object
of class scoreInvHapRes
. This object contains the
classification of the individuals into the inversion-genotypes, as well
as the similarity scores. We can retrieve results with the
classification
and the scores
functions
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
## II II NI NI II NN
## Levels: II NI NN
## IaIa IaIb IbIb NaIa NaIb NaNa NaNb
## HG00096 0.7636176 0.7659578 0.9945596 0.6346131 0.7729563 0.7725742 0.4156252
## HG00097 0.7636176 0.7659578 0.9945596 0.6346131 0.7729563 0.7725742 0.4156252
## HG00099 0.6438462 0.7267291 0.7685422 0.7768338 0.9758600 0.7703418 0.5317241
## HG00100 0.3431981 0.4263283 0.4892990 0.2604141 0.4764038 0.3691768 0.6083141
## HG00101 0.9841178 0.7603894 0.7584714 0.7188997 0.6280646 0.7182172 0.3558992
## HG00102 0.2518848 0.1732271 0.3167020 0.1441663 0.1994766 0.3004813 0.5812806
## NbIa NbIb NbNb
## HG00096 0.3784514 0.5670885 0.3911247
## HG00097 0.3784514 0.5670885 0.3911247
## HG00099 0.3025523 0.5345369 0.2689791
## HG00100 0.6376241 0.6834476 0.5605652
## HG00101 0.4545927 0.4165003 0.3266407
## HG00102 0.5527802 0.5480628 0.7223922
Inversion genotypes is returned as a factor, which includes the
individual names present in the snpMatrix
or
VCF
. Thus, inversion classification can be readily used in
down-stream association tests.
We can retrieve other values that are useful to evaluate the quality of the inference. For each individual, we can obtain the highest similarity score and the difference between the highest similarity score and the second highest:
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
## 0.9945596 0.9945596 0.9758600 0.6834476 0.9841178 0.7223922
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
## 0.2216032 0.2216032 0.1990263 0.0458235 0.2237283 0.1411116
Classification is good when the highest score is close to 1 and the
other scores are small. This means that the SNPs in the individual are
almost identical to one of the reference genotypes and different from
the rest. We can use the difference between the highest score and the
second highest score as a quality measure. We can have a visual
evaluation of the quality parameters with the function
plotScores
:
The horizontal line is a threshold for quality, set to 0.1 but can be
changed in the parameter minDiff
in the function
scoreInvHap
. This default value considers that the SNPs of
the individual are at least 10% more similar to the selected reference
than to second one. plotScores
can be customized.
Another quality measure is based on the number of SNPs used in the computation. scoreInvHap allows individuals having different SNP coverages within the inverted regions. SNPs with no information are excluded from the computation of the scores. To reflect this we report a SNP call rate:
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
## 307 307 307 307 307 307
## HG00096 HG00097 HG00099 HG00100 HG00101 HG00102
## 1 1 1 1 1 1
The number of SNPs must always be taken into account to evaluate the
performance of the computation. It is highly recommended to use, at
least, 15 SNPs in any inversion-calling. plotCallRate
can
be used to visualize the call rate
The vertical line is the minimum recommended call rate to consider a
sample. By default, it is set to 0.9 but can be changed with the
parameter callRate
.
The function classification
have two parameters that
selects samples based on these two QC parameters. The argument
minDiff
sets the minimum difference between the highest and
the second highest score. The argument callRate
sets the
minimum call rate of a sample to pass the QC. By default, both arguments
are set to 0 so all the samples are included:
## [1] 30
## [1] 27
Finally, the function classification
has the argument
inversion
that, when it is set to FALSE,
haplotype-genotypes are called instead of inversion-genotuypes. This is
useful for inversions that have more than one haplotype per inversion
status:
##
## II NI NN
## 6 17 7
##
## IaIa IaIb IbIb NaIa NaIb NaNa NaNb NbIa NbIb NbNb
## 2 0 4 5 5 0 4 3 4 3
When SNPs data are imputed, we obtain three different types of
results: the best-guess, the dosage and the posterior probabilities. By
default, scoreInvHap
uses the best-guess when computing the
similarity scores. However, we can also use posterior probabilities
setting the argument probs
to TRUE:
## scoreInvHapRes
## Samples: 30
## Genotypes' table:
## II NI NN
## 6 17 7
## - Inversion genotypes' table:
## NN NI II
## 7 17 6
## - Inversion frequency: 48.33%
In this case, the samples were identically classified in both cases:
## BestGuess
## PostProbs II NI NN
## II 6 0 0
## NI 0 17 0
## NN 0 0 7
There are two additional parameters of scoreInvHap
than
can reduce computing time: R2
and BPPARAM
.
R2
indicates the minimum R2 that a SNP should
have with the inversion to be included in the score. The less number of
SNPs the less time it takes. By default, all SNPs are included in the
computation. On the other hand, BPPARAM
requires an
instance of BiocParallelParam
, which allows to parallelize
the computation of the score. You can find more information about this
class in its help page (?bpparam
) and in the BiocParallel
vignette.
The following table describes the inversion includes in
scoreInvHap
:
scoreInvHap Label | Locus | Length (Kb) | Num. Haplos |
---|---|---|---|
inv1_004 | 1p22.1 | 0.77 | 2 |
inv1_008 | 1q31.1 | 1.2 | 2 |
inv2_002 | 2p22.3 | 0.72 | 2 |
inv2_013 | 2q22.1 | 4.25 | 2 |
inv3_003 | 3q26.1 | 2.28 | 4 (3/1) |
inv6_002 | 6p21.33 | 0.87 | 2 |
inv6_006 | 6q23.1 | 4.13 | 2 |
inv7_003 | 7p14.3 | 5.25 | 2 |
inv7_005 | 7p11.2 | 73.9 | 4 (2/2) |
inv7_011 | 7q11.22 | 12.7 | 2 |
inv7_014 | 7q36.1 | 2.08 | 2 |
inv8_001 | 8p23.1 | 3925 | 2 |
inv11_001 | 11p12 | 4.75 | 2 |
inv11_004 | 11q13.2 | 1.38 | 3 (2/1) |
inv12_004 | 12q21.2 | 19.3 | 2 |
inv12_006 | 12q21.2 | 1.03 | 3 (2/1) |
inv14_005 | 14q23.3 | 0.86 | 2 |
inv16_009 | 16p11.2 | 364.2 | 2 |
inv17_007 | 17q21.31 | 711 | 2 |
inv21_005 | 21q21.3 | 1.06 | 4 (3/1) |
invX_006 | Xq13.2 | 90.8 | 4 (3/1) |
This information is also available in inversionGR
:
## GRanges object with 22 ranges and 5 metadata columns:
## seqnames ranges strand | scoreInvHap.name
## <Rle> <IRanges> <Rle> | <character>
## inv1_004 chr1 92131841-92132615 * | inv1_004
## inv1_008 chr1 197756784-197757982 * | inv1_008
## inv2_002 chr2 33764554-33765272 * | inv2_002
## inv2_013 chr2 139004949-139009203 * | inv2_013
## inv3_003 chr3 162545362-162547641 * | inv3_003
## ... ... ... ... . ...
## inv17_007 chr17 43661775-44372665 * | inv17_007
## inv21_005 chr21 28020653-28021711 * | inv21_005
## invX_006 chrX 72215927-72306774 * | invX_006
## inv16_009 chr16 28424774-28788943 * | inv16_009
## inv10_005 chr10 27220925-27656433 * | inv10_005
## Cytogenetic.location Inv.freq Haplotypes Num.SNPs
## <character> <numeric> <numeric> <numeric>
## inv1_004 1p22.1 11.23 2 6
## inv1_008 1q31.3 19.68 2 5
## inv2_002 2p22.3 15.11 2 6
## inv2_013 2q22.1 71.47 2 13
## inv3_003 3q26.1 56.16 4 6
## ... ... ... ... ...
## inv17_007 17q21.31 23.96 2 3637
## inv21_005 21q21.3 51.29 4 11
## invX_006 Xq13.2 13.30 4 135
## inv16_009 16p11.2 34.49 2 361
## inv10_005 <NA> NA NA NA
## -------
## seqinfo: 14 sequences from an unspecified genome; no seqlengths
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] snpStats_1.56.0 Matrix_1.7-1
## [3] survival_3.7-0 VariantAnnotation_1.52.0
## [5] Rsamtools_2.22.0 Biostrings_2.75.0
## [7] XVector_0.46.0 SummarizedExperiment_1.36.0
## [9] Biobase_2.67.0 GenomicRanges_1.59.0
## [11] GenomeInfoDb_1.43.0 IRanges_2.41.0
## [13] S4Vectors_0.44.0 MatrixGenerics_1.19.0
## [15] matrixStats_1.4.1 BiocGenerics_0.53.0
## [17] scoreInvHap_1.29.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.47.0 rjson_0.2.23 xfun_0.48
## [4] bslib_0.8.0 lattice_0.22-6 vctrs_0.6.5
## [7] tools_4.4.1 bitops_1.0-9 curl_5.2.3
## [10] parallel_4.4.1 AnnotationDbi_1.69.0 RSQLite_2.3.7
## [13] highr_0.11 blob_1.2.4 BSgenome_1.75.0
## [16] lifecycle_1.0.4 GenomeInfoDbData_1.2.13 compiler_4.4.1
## [19] codetools_0.2-20 htmltools_0.5.8.1 sys_3.4.3
## [22] buildtools_1.0.0 sass_0.4.9 RCurl_1.98-1.16
## [25] yaml_2.3.10 crayon_1.5.3 jquerylib_0.1.4
## [28] BiocParallel_1.41.0 DelayedArray_0.33.1 cachem_1.1.0
## [31] abind_1.4-8 digest_0.6.37 restfulr_0.0.15
## [34] splines_4.4.1 maketools_1.3.1 fastmap_1.2.0
## [37] grid_4.4.1 cli_3.6.3 SparseArray_1.6.0
## [40] S4Arrays_1.6.0 GenomicFeatures_1.59.0 XML_3.99-0.17
## [43] UCSC.utils_1.2.0 bit64_4.5.2 rmarkdown_2.28
## [46] httr_1.4.7 bit_4.5.0 png_0.1-8
## [49] memoise_2.0.1 evaluate_1.0.1 knitr_1.48
## [52] BiocIO_1.17.0 rtracklayer_1.66.0 rlang_1.1.4
## [55] DBI_1.2.3 BiocManager_1.30.25 jsonlite_1.8.9
## [58] R6_2.5.1 GenomicAlignments_1.43.0 zlibbioc_1.52.0