Original version: 1 May, 2024
ClinVar is a freely available, public archive of human genetic variants that provides clinical classifications for whether a variant is likely benign or pathogenic. The AlphaMissense publication uses the ClinVar data to evaluate and calibrate the predictions generated by their model. A table containing ClinVar information for 82872 variants across 7951 proteins was derived from the supplemental data of the AlphaMissense paper, and is made available through this package for benchmarking and visualization purposes.
The ClinVar table can be accessed using clinvar_data()
from the database.
clinvar_data()
#> * [03:43:24][info] downloading or finding local file
#> * [03:43:24][info] creating database table 'clinvar'
#> * [03:43:24][info] disconnecting all registered connections
#> # Source: table<clinvar> [?? x 5]
#> # Database: DuckDB v1.1.1 [unknown@Linux 6.5.0-1025-azure:R 4.4.1//github/home/.cache/R/BiocFileCache/1e4e4a1eb813_1e4e4a1eb813]
#> variant_id transcript_id protein_variant AlphaMissense label
#> <chr> <chr> <chr> <dbl> <fct>
#> 1 chr1_925969_C_T_hg38 ENST00000342066.8 Q96NU1:P10S 0.967 benign
#> 2 chr1_930165_G_A_hg38 ENST00000342066.8 Q96NU1:R28Q 0.663 benign
#> 3 chr1_930204_G_A_hg38 ENST00000342066.8 Q96NU1:R41Q 0.0866 benign
#> 4 chr1_930245_G_A_hg38 ENST00000342066.8 Q96NU1:D55N 0.134 benign
#> 5 chr1_930248_G_A_hg38 ENST00000342066.8 Q96NU1:G56S 0.100 benign
#> 6 chr1_930282_G_A_hg38 ENST00000342066.8 Q96NU1:R67Q 0.0635 benign
#> 7 chr1_930285_G_A_hg38 ENST00000342066.8 Q96NU1:R68Q 0.0629 benign
#> 8 chr1_930314_C_T_hg38 ENST00000342066.8 Q96NU1:H78Y 0.110 benign
#> 9 chr1_930320_C_T_hg38 ENST00000342066.8 Q96NU1:R80C 0.0918 benign
#> 10 chr1_931058_G_A_hg38 ENST00000342066.8 Q96NU1:V92M 0.196 benign
#> # ℹ more rows
The ClinVar table is now available for exploration or parsing.
This section uses the clinvar_plot()
function to
generate a scatterplot for benchmarking and comparing ClinVar
classification with AlphaMissense predictions. By default, the function
takes one UniProt accession identifier, derives AlphaMissense scores
from am_data("aa_substitution")
, and pulls ClinVar
classifications from the data.frame previously obtained. Alternatively,
it is possible to pass a custom AlphaMissense or ClinVar table to the
function. See function details for more information.
clinvar_plot(uniprotId = "P37023")
#> * [03:43:25][info] 'alphamissense_table' not provided, using default 'am_data("aa_substitution")' table accessed through the AlphaMissenseR package
#> * [03:43:25][info] 'clinvar_table' not provided, using default ClinVar dataset in AlphaMissenseR package
We returned a ggplot
object which overlays ClinVar
classifications onto AlphaMissense predicted scores. Blue, gray, and red
colors represent pathogenicity classifications for “likely benign”,
“ambiguous”, or “likely pathogenic”, respectively. Large, bolded points
are ClinVar variants colored according to their clinical classification,
while smaller points in the background are AlphaMissense
predictions.
We can note a discrepancy between the clinically-validated annotations and the AlphaMissense predictions around position 50. AlphaMissense seems to predict several variants in that region as likely benign, while ClinVar identifies them as pathogenic.
Because the ClinVar dataset is not exhaustive (not all proteins have been clinically-assessed), there may be proteins where information is not available. In this case, the function will provide an error.
Remember to disconnect from the database.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] gghalves_0.1.4 ggplot2_3.5.1 ggdist_3.3.2
#> [4] tidyr_1.3.1 ExperimentHub_2.13.1 AnnotationHub_3.13.3
#> [7] BiocFileCache_2.13.2 dbplyr_2.5.0 BiocGenerics_0.51.3
#> [10] AlphaMissenseR_1.3.0 dplyr_1.1.4 rmarkdown_2.28
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 viridisLite_0.4.2 farver_2.1.2
#> [4] blob_1.2.4 filelock_1.0.3 Biostrings_2.73.2
#> [7] fastmap_1.2.0 duckdb_1.1.1 promises_1.3.0
#> [10] digest_0.6.37 mime_0.12 lifecycle_1.0.4
#> [13] r3dmol_0.1.2 KEGGREST_1.45.1 RSQLite_2.3.7
#> [16] magrittr_2.0.3 compiler_4.4.1 rlang_1.1.4
#> [19] sass_0.4.9 tools_4.4.1 utf8_1.2.4
#> [22] yaml_2.3.10 knitr_1.48 labeling_0.4.3
#> [25] htmlwidgets_1.6.4 bit_4.5.0 spdl_0.0.5
#> [28] curl_5.2.3 withr_3.0.2 purrr_1.0.2
#> [31] sys_3.4.3 grid_4.4.1 stats4_4.4.1
#> [34] fansi_1.0.6 xtable_1.8-4 colorspace_2.1-1
#> [37] scales_1.3.0 cli_3.6.3 crayon_1.5.3
#> [40] generics_0.1.3 httr_1.4.7 BiocBaseUtils_1.7.3
#> [43] DBI_1.2.3 cachem_1.1.0 zlibbioc_1.51.2
#> [46] parallel_4.4.1 AnnotationDbi_1.67.0 BiocManager_1.30.25
#> [49] XVector_0.45.0 vctrs_0.6.5 jsonlite_1.8.9
#> [52] IRanges_2.39.2 S4Vectors_0.43.2 bit64_4.5.2
#> [55] maketools_1.3.1 jquerylib_0.1.4 bio3d_2.4-5
#> [58] glue_1.8.0 distributional_0.5.0 gtable_0.3.6
#> [61] BiocVersion_3.20.0 later_1.3.2 GenomeInfoDb_1.41.2
#> [64] GenomicRanges_1.57.2 UCSC.utils_1.1.0 munsell_0.5.1
#> [67] tibble_3.2.1 pillar_1.9.0 rappdirs_0.3.3
#> [70] htmltools_0.5.8.1 GenomeInfoDbData_1.2.13 R6_2.5.1
#> [73] shiny.gosling_1.1.0 evaluate_1.0.1 shiny_1.9.1
#> [76] Biobase_2.65.1 highr_0.11 png_0.1-8
#> [79] memoise_2.0.1 httpuv_1.6.15 bslib_0.8.0
#> [82] RcppSpdlog_0.0.18 rjsoncons_1.3.1 Rcpp_1.0.13
#> [85] whisker_0.4.1 xfun_0.48 buildtools_1.0.0
#> [88] pkgconfig_2.0.3