---
title: "Accessing data from the IGVF Catalog"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Accessing data from the IGVF Catalog}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# IGVF background

The [Impact of Genomic Variation on Function (IGVF)](https://igvf.org/) 
Consortium,

> aims to understand how genomic variation affects genome function, which in
> turn impacts phenotype. The NHGRI is funding this collaborative program that
> brings together teams of investigators who will use state-of-the-art
> experimental and computational approaches to model, predict, characterize and
> map genome function, how genome function shapes phenotype, and how these
> processes are affected by genomic variation. These joint efforts will produce
> a catalog of the impact of genomic variants on genome function and phenotypes. 

The *IGVF Catalog* described in the last sentence is available through a number
of interfaces, including a [web interface](https://catalog.igvf.org/) as well as 
two programmatic interfaces. In addition, there is a *Data Portal*, where raw
and processed data can be downloaded, with its own
[web interface](https://data.igvf.org/) and 
[client library](https://github.com/IGVF-DACC/igvf-r-client) for R. 
This package, `rigvf`, focuses on the *Catalog* and not the *Data Portal*.

## Knowledge graph of variant function

The *IGVF Catalog* is a form of *knowledge graph*, where the nodes are
biological entities such as variants, genes, pathways, etc. and edges are
relationships between such nodes, e.g. empirically measured effects of variants
on *cis*-regulatory elements (CREs) or on transcripts and proteins. These edges
may have metadata including information about cell type context and information
about how the association was measured, e.g. which experiment or predictive
model.

## The *rigvf* package

This package provides access to some of the IGVF Catalog from within 
R/Bioconductor. Currently, limited functionality is implemented, 
with priority towards applications and integration with Bioconductor 
functionality. However, the [source code](https://github.com/IGVF/rigvf)
and internal functions can also be used to build one-off querying 
functions.

## Data license

For IGVF Catalog data license, see the [IGVF policy page](https://data.igvf.org/policies/).

# Catalog API

The IGVF Catalog offers two programmatic interfaces, the Catalog API and 
an ArangoDb interface, described further below. The [Catalog API][]
is preferred, with optimized queries of relevant information. 
Queries are simple REST requests implemented using the *httr2* package. 
Here we query variants associated with "GCK"; one could also use, e.g., 
Ensembl identifiers.

[Catalog API]: https://api.catalog.igvf.org/#

```{r}
library(rigvf)
gene_variants(gene_name = "GCK")
```

Note that we only receive a limited number of responses. We can change both
the `page` of responses we receive and the `limit` per page:

```{r}
gene_variants(gene_name = "GCK", page=1L)
gene_variants(gene_name = "GCK", limit=50L)
```

We can also pass thresholds on the negative log10 p-value or the effect size, 
using the following letter combinations, `gt (>), gte (>=), lt (<), lte (<=)`, 
followed by a `:` and a value. See examples below:

```{r}
gene_variants(gene_name = "GCK", log10pvalue="gt:5.0")
gene_variants(gene_name = "GCK", effect_size="gt:0.5")
```

The help page `?catalog_queries` outlines other available user-facing
functions. 

Note that nested columns in the response can be widen-ed using *tidyr*.
For example, when querying genomic elements that are associated with a gene,
we get back nested output:

```{r}
res <- gene_elements(gene_id = "ENSG00000187961", verbose = TRUE)
res

res |>
    dplyr::select(`regions`) |>
    tidyr::unnest_wider(`regions`)
```

## GRanges-based queries

The functions `elements()` and `element_genes()` take *GRanges* input ranges, 
and `elements()` provides back the response as *GRanges*. Note that IGVF uses 
hg38 and USCS-style chromosome names, e.g., chr1.

# ArangoDB API

The ArangoDB API provides flexibility but requires greater
understanding of Arango Query Language and the database
schema. Documentation is available in the [database][] under the
'Support' menu item 'REST API' tab using username 'guest' and password
'guestigvfcatalog'.

[database]: https://db.catalog.igvf.org/_db/igvf/_admin/aardvark/index.html#support

The following directly queries the database for variants of an Ensembl
gene id.

```{r}
db_gene_variants("ENSG00000106633", threshold = 0.85)
```

The AQL is

```{r}
aql <- system.file(package = "rigvf", "aql", "gene_variants.aql")
readLines(aql) |> noquote()
```

The help page `?db_queries` outlines other available user-facing
functions. See `?arango` for more developer-oriented information.

# Use with Bioconductor

## Compute overlap with *plyranges*

We can query genomic elements using `elements()` and then compute 
overlaps with the *plyranges* package as below. See the 
[plyranges](https://github.com/tidyomics/plyranges) package documentation 
for more examples.

```{r message=FALSE}
library(plyranges)
rng <- data.frame(seqnames="chr1", start=10e6+1, end=10.1e6) |>
  as_granges()

e <- elements(rng, limit=200L) |>
  plyranges::filter(source != "ENCODE_EpiRaction")
e
```

```{r}
tiles <- tile_ranges(rng, width=10e3) %>%
  plyranges::select(-partition) %>%
  plyranges::mutate(id = letters[seq_along(.)])

# count overlaps of central basepair of elements in tiles
e |>
  plyranges::anchor_center() |>
  plyranges::mutate(width = 1) |>
  plyranges::join_overlap_left(tiles) |>
  tibble::as_tibble() |>
  dplyr::count(id)
```

## Plot variants with *plotgardener*

Below we show a simple example of plotting variants and 
elements in a gene context, using the *plotgardener* package.

First selecting some variants around the gene *GCK* (reminder that we
only obtain the first `limit` number of variants, see ?catalog_queries 
for more details).

```{r}
# up to 200 variants:
v <- gene_variants(gene_name = "GCK", limit=200L, verbose=TRUE) |>
  dplyr::select(-c(gene, chr, source, source_url)) |>
  tidyr::unnest_wider(`sequence variant`) |>
  dplyr::rename(seqnames = chr) |>
  dplyr::mutate(start = pos + 1, end = pos + 1) |>
  as_granges()
```

We then load *plotgardener* and define some default parameters.

```{r message=FALSE}
library(plotgardener)
par <- pgParams(
  chrom = "chr7",
  chromstart = 44.1e6,
  chromend = 44.25e6,
  assembly = "hg38",
  just = c("left", "bottom")
)
```

Renaming some columns:

```{r}
v_for_plot <- v |>
  plyranges::select(snp = rsid, p = log10pvalue, effect_size)
```

To match with the correct gene annotation, we could explicitly define 
the transcript database using `plotgardener::assembly()`, where we would
provide a database built by running `GenomicFeatures::makeTxDbFromGFF()` 
on a GENCODE GTF file. Here we use one of the standard hg38 TxDb with
UCSC-style chromosome names.

```{r}
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(org.Hs.eg.db)
```

Below we build the page and populate it with annotation from the *TxDb*
used by *plotgardener* and from IGVF Catalog.

```{r plotgardener, fig.alt="Manhattan of IGVF annotated variants", fig.dim = c(5, 4)}
pageCreate(width = 5, height = 4, showGuides = FALSE)
plotGenes(
  params = par, x = 0.5, y = 3.5, width = 4, height = 1
)
plotGenomeLabel(
  params = par,
  x = 0.5, y = 3.5, length = 4,
  just = c("left", "top")
)
mplot <- plotManhattan(
  params = par, x = 0.5, y = 2.5, width = 4, height = 2,
  v_for_plot, trans = "",
  sigVal = -log10(5e-8), sigLine = TRUE, col = "grey", lty = 2
)
annoYaxis(
    plot = mplot, at=0:4 * 4, axisLine = TRUE, fontsize = 8
)
annoXaxis(
    plot = mplot, axisLine = TRUE, label = FALSE
)
plotText(
    params = par,
    label = "-log10(p-value)", x = 0.2, y = 2, rot = 90,
    fontsize = 8, fontface = "bold",
    default.units = "inches"
)
```

# Session info

```{r}
sessionInfo()
```