--- title: "Accessing data from the IGVF Catalog" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Accessing data from the IGVF Catalog} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # IGVF background The [Impact of Genomic Variation on Function (IGVF)](https://igvf.org/) Consortium, > aims to understand how genomic variation affects genome function, which in > turn impacts phenotype. The NHGRI is funding this collaborative program that > brings together teams of investigators who will use state-of-the-art > experimental and computational approaches to model, predict, characterize and > map genome function, how genome function shapes phenotype, and how these > processes are affected by genomic variation. These joint efforts will produce > a catalog of the impact of genomic variants on genome function and phenotypes. The *IGVF Catalog* described in the last sentence is available through a number of interfaces, including a [web interface](https://catalog.igvf.org/) as well as two programmatic interfaces. In addition, there is a *Data Portal*, where raw and processed data can be downloaded, with its own [web interface](https://data.igvf.org/) and [client library](https://github.com/IGVF-DACC/igvf-r-client) for R. This package, `rigvf`, focuses on the *Catalog* and not the *Data Portal*. ## Knowledge graph of variant function The *IGVF Catalog* is a form of *knowledge graph*, where the nodes are biological entities such as variants, genes, pathways, etc. and edges are relationships between such nodes, e.g. empirically measured effects of variants on *cis*-regulatory elements (CREs) or on transcripts and proteins. These edges may have metadata including information about cell type context and information about how the association was measured, e.g. which experiment or predictive model. ## The *rigvf* package This package provides access to some of the IGVF Catalog from within R/Bioconductor. Currently, limited functionality is implemented, with priority towards applications and integration with Bioconductor functionality. However, the [source code](https://github.com/IGVF/rigvf) and internal functions can also be used to build one-off querying functions. ## Data license For IGVF Catalog data license, see the [IGVF policy page](https://data.igvf.org/policies/). # Catalog API The IGVF Catalog offers two programmatic interfaces, the Catalog API and an ArangoDb interface, described further below. The [Catalog API][] is preferred, with optimized queries of relevant information. Queries are simple REST requests implemented using the *httr2* package. Here we query variants associated with "GCK"; one could also use, e.g., Ensembl identifiers. [Catalog API]: https://api.catalog.igvf.org/# ```{r} library(rigvf) gene_variants(gene_name = "GCK") ``` Note that we only receive a limited number of responses. We can change both the `page` of responses we receive and the `limit` per page: ```{r} gene_variants(gene_name = "GCK", page=1L) gene_variants(gene_name = "GCK", limit=50L) ``` We can also pass thresholds on the negative log10 p-value or the effect size, using the following letter combinations, `gt (>), gte (>=), lt (<), lte (<=)`, followed by a `:` and a value. See examples below: ```{r} gene_variants(gene_name = "GCK", log10pvalue="gt:5.0") gene_variants(gene_name = "GCK", effect_size="gt:0.5") ``` The help page `?catalog_queries` outlines other available user-facing functions. Note that nested columns in the response can be widen-ed using *tidyr*. For example, when querying genomic elements that are associated with a gene, we get back nested output: ```{r} res <- gene_elements(gene_id = "ENSG00000187961", verbose = TRUE) res res |> dplyr::select(`regions`) |> tidyr::unnest_wider(`regions`) ``` ## GRanges-based queries The functions `elements()` and `element_genes()` take *GRanges* input ranges, and `elements()` provides back the response as *GRanges*. Note that IGVF uses hg38 and USCS-style chromosome names, e.g., chr1. # ArangoDB API The ArangoDB API provides flexibility but requires greater understanding of Arango Query Language and the database schema. Documentation is available in the [database][] under the 'Support' menu item 'REST API' tab using username 'guest' and password 'guestigvfcatalog'. [database]: https://db.catalog.igvf.org/_db/igvf/_admin/aardvark/index.html#support The following directly queries the database for variants of an Ensembl gene id. ```{r} db_gene_variants("ENSG00000106633", threshold = 0.85) ``` The AQL is ```{r} aql <- system.file(package = "rigvf", "aql", "gene_variants.aql") readLines(aql) |> noquote() ``` The help page `?db_queries` outlines other available user-facing functions. See `?arango` for more developer-oriented information. # Use with Bioconductor ## Compute overlap with *plyranges* We can query genomic elements using `elements()` and then compute overlaps with the *plyranges* package as below. See the [plyranges](https://github.com/tidyomics/plyranges) package documentation for more examples. ```{r message=FALSE} library(plyranges) rng <- data.frame(seqnames="chr1", start=10e6+1, end=10.1e6) |> as_granges() e <- elements(rng, limit=200L) |> plyranges::filter(source != "ENCODE_EpiRaction") e ``` ```{r} tiles <- tile_ranges(rng, width=10e3) %>% plyranges::select(-partition) %>% plyranges::mutate(id = letters[seq_along(.)]) # count overlaps of central basepair of elements in tiles e |> plyranges::anchor_center() |> plyranges::mutate(width = 1) |> plyranges::join_overlap_left(tiles) |> tibble::as_tibble() |> dplyr::count(id) ``` ## Plot variants with *plotgardener* Below we show a simple example of plotting variants and elements in a gene context, using the *plotgardener* package. First selecting some variants around the gene *GCK* (reminder that we only obtain the first `limit` number of variants, see ?catalog_queries for more details). ```{r} # up to 200 variants: v <- gene_variants(gene_name = "GCK", limit=200L, verbose=TRUE) |> dplyr::select(-c(gene, chr, source, source_url)) |> tidyr::unnest_wider(`sequence variant`) |> dplyr::rename(seqnames = chr) |> dplyr::mutate(start = pos + 1, end = pos + 1) |> as_granges() ``` We then load *plotgardener* and define some default parameters. ```{r message=FALSE} library(plotgardener) par <- pgParams( chrom = "chr7", chromstart = 44.1e6, chromend = 44.25e6, assembly = "hg38", just = c("left", "bottom") ) ``` Renaming some columns: ```{r} v_for_plot <- v |> plyranges::select(snp = rsid, p = log10pvalue, effect_size) ``` To match with the correct gene annotation, we could explicitly define the transcript database using `plotgardener::assembly()`, where we would provide a database built by running `GenomicFeatures::makeTxDbFromGFF()` on a GENCODE GTF file. Here we use one of the standard hg38 TxDb with UCSC-style chromosome names. ```{r} library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(org.Hs.eg.db) ``` Below we build the page and populate it with annotation from the *TxDb* used by *plotgardener* and from IGVF Catalog. ```{r plotgardener, fig.alt="Manhattan of IGVF annotated variants", fig.dim = c(5, 4)} pageCreate(width = 5, height = 4, showGuides = FALSE) plotGenes( params = par, x = 0.5, y = 3.5, width = 4, height = 1 ) plotGenomeLabel( params = par, x = 0.5, y = 3.5, length = 4, just = c("left", "top") ) mplot <- plotManhattan( params = par, x = 0.5, y = 2.5, width = 4, height = 2, v_for_plot, trans = "", sigVal = -log10(5e-8), sigLine = TRUE, col = "grey", lty = 2 ) annoYaxis( plot = mplot, at=0:4 * 4, axisLine = TRUE, fontsize = 8 ) annoXaxis( plot = mplot, axisLine = TRUE, label = FALSE ) plotText( params = par, label = "-log10(p-value)", x = 0.2, y = 2, rot = 90, fontsize = 8, fontface = "bold", default.units = "inches" ) ``` # Session info ```{r} sessionInfo() ```