---
title: "Introduction_3_loadfrequency"
author: "Hangjia Zhao"
date: "`r Sys.Date()`"
output: 
  BiocStyle::html_document:
  toc: true
vignette: >
  %\VignetteIndexEntry{Introduction_3_loadfrequency}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

[Progenetix](https://progenetix.org/) is an open data resource that provides curated individual cancer copy number variation (CNV) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette offers a comprehensive guide on accessing and visualizing CNV frequency data within the Progenetix database. CNV frequency is pre-calculated based on CNV segment data in Progenetix and reflects the CNV pattern in a cohort. It is defined as the percentage of samples showing a CNV for a genomic region (1MB-sized genomic bins in this case) over the total number of samples in a cohort specified by filters. 

If your focus lies in cancer cell lines, you can access data from [*cancercelllines.org*](https://cancercelllines.org/) by setting the `domain` parameter to "https://cancercelllines.org" in `pgxLoader` function. This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations.


# Load library 

```{r setup}
library(pgxRpi)
library(SummarizedExperiment) # for pgxmatrix data
library(GenomicRanges) # for pgxfreq data
```

## `pgxLoader` function 

This function loads various data from `Progenetix` database via the Beacon v2 API with some extensions (BeaconPlus).  

The parameters of this function used in this tutorial:

* `type`: A string specifying output data type. Only "cnv_frequency" is used in this tutorial.
* `output`: A string specifying output file format. When the parameter `type` is "cnv_frequency", available options are "pgxfreq" or "pgxmatrix".
* `filters`: Identifiers used in public repositories, bio-ontology terms, or custom terms such as c("NCIT:C7376", "PMID<!-- -->:22824167"). When multiple filters are used, they are combined using OR logic when the parameter `type` is "cnv_frequency". For more information about filters, see the [documentation](https://docs.progenetix.org/common/beacon-api/#filters-filters-filtering-terms).
* `dataset`: A string specifying the dataset to query from the Beacon response. Default is NULL, which includes results from all datasets.

# Retrieve CNV frequency

## The first output format (`output` = "pgxfreq")

```{r}
freq_pgxfreq <- pgxLoader(type="cnv_frequency", output ="pgxfreq",
                         filters=c("NCIT:C4038","pgx:icdom-85003"))

freq_pgxfreq
```

The returned data is stored in [*GRangesList*](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) container which consists of multiple *GRanges* objects.
Each *GRanges* object stores CNV frequency from samples pecified by a particular filter. 
Within each GRanges object, you can find annotation columns "gain_frequency" and "loss_frequency" in each row, which express the percentage values across samples (%) for gains and losses that overlap the corresponding genomic interval.

These genomic intervals are derived from the partitioning of the entire genome (GRCh38). Most of these bins have a size of 1MB, except for a few bins located near the telomeres. In total, there are 3106 intervals encompassing the genome.

To access the CNV frequency data from specific filters, you could access like this

```{r}
freq_pgxfreq[["NCIT:C4038"]]
```

To get metadata such as count of samples used to calculate frequency, use `mcols` function from *GenomicRanges* package:

```{r}
mcols(freq_pgxfreq)
```

## The second output format (`output` = "pgxmatrix")

Choose 8 NCIt codes of interests that correspond to different tumor types

```{r}
code <-c("C3059","C3716","C4917","C3512","C3493","C3771","C4017","C4001")
# add prefix for query
code <- sub(".",'NCIT:C',code)
```

load data with the specified codes

```{r}
freq_pgxmatrix <- pgxLoader(type="cnv_frequency",output ="pgxmatrix",filters=code)
freq_pgxmatrix
```

The returned data is stored in [*RangedSummarizedExperiment*](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) object, which is a matrix-like container where rows represent ranges of interest (as a GRanges object) and columns represent filters. 

To get metadata such as count of samples used to calculate frequency, use `colData` function from *SummarizedExperiment* package:

```{r}
colData(freq_pgxmatrix)
```

To access the CNV frequency matrix, use `assay` accesssor from *SummarizedExperiment* package

```{r}
head(assay(freq_pgxmatrix))
```

The matrix has 6212 rows (genomic regions) and 8 columns (filters). The rows comprised 3106 intervals with “gain status” plus 3106 intervals with “loss status”. 

The value is the percentage of samples from the corresponding filter having one or more CNV events in the 
specific genomic intervals. You could get the interval information by `rowRanges` function from *SummarizedExperiment* package

```{r}
rowRanges(freq_pgxmatrix)
```

For example, if the value in the second row and first column is 8.457, it means that 8.457% samples from the corresponding filter NCIT:C3059 having one or more duplication events in the genomic interval in chromosome 1: 400000-1400000.

Note: it is different from CNV fraction matrix introduced in Introduction_2_loadvariants. Value in this matrix is **percentage (%) of samples** having one or more CNVs overlapped with the binned interval while the value in CNV fraction matrix is **fraction in individual samples** to indicate how much the binned interval overlaps with one or more CNVs in one sample.

# Calculate CNV frequency 

## `segtoFreq` function

This function computes the binned CNV frequency from segment data.

The parameters of this function:

* `data`: Segment data with CNV states. The first four columns should specify sample ID, chromosome, start position, and end position, respectively. The column representing CNV states should contain either "DUP" for duplications or "DEL" for deletions.
* `cnv_column_idx`: Index of the column specifying CNV state. Default is 6, following the "pgxseg" format used in Progenetix. If the input segment data uses the general `.seg` file format, it might need to be set differently.
* `cohort_name`: A string specifying the cohort name. Default is "unspecified cohort".
* `assembly`: A string specifying the genome assembly version for CNV frequency calculation. Allowed options are "hg19" or "hg38". Default is "hg38".
* `bin_size`: Size of genomic bins used to split the genome, in base pairs (bp). Default is 1,000,000.
* `overlap`: Numeric value defining the amount of overlap between bins and segments considered as bin-specific CNV, in base pairs (bp). Default is 1,000.
* `soft_expansion`: Fraction of `bin_size` to determine merge criteria. During the generation of genomic bins, division starts at the centromere and expands towards the telomeres on both sides. If the size of the last bin is smaller than `soft_expansion` * bin_size, it will be merged with the previous bin. Default is 0.1.

Suppose you have segment data from several biosamples:

```{r}
# access variant data
variants <- pgxLoader(type="g_variants",biosample_id = c("pgxbs-kftvhmz9", "pgxbs-kftvhnqz","pgxbs-kftvhupd"),output="pgxseg")
# only keep segment cnv data
segdata <- variants[variants$variant_type %in% c("DUP","DEL"),]
```

You can then calculate the CNV frequency from this cohort comprised of these samples. The output is stored in "pgxfreq" format:

```{r}
segfreq <- segtoFreq(segdata,cohort_name="c1")
segfreq
```

# Visualize CNV frequency

## `pgxFreqplot` function

This function provides CNV frequency plots by genome or chromosomes as you request.

The parameters of this function:

* `data`: CNV frequency object returned by `pgxLoader` or `segtoFreq` functions. 
* `chrom`: A vector specifying which chromosomes to plot. If NULL, the plot will cover the entire genome. 
#' If specified, the frequencies are plotted with one panel for each chromosome. Default is NULL.
* `layout`: Number of columns and rows in plot. Only used in plot by chromosome. Default is c(1,1).
* `filters`: Index or string value indicating which filter to plot. The length of filters
is limited to one if the parameter `circos` is FALSE. Default is the first filter.
* `circos`: A logical value indicating whether to return a circos plot. If TRUE, it returns a circos plot 
that can display and compare multiple filters. Default is FALSE.
* `highlight`: Indices of genomic bins to be highlighted in red.
* `assembly`: A string specifying the genome assembly version to apply to CNV frequency plotting. 
Allowed options are "hg19" and "hg38". Default is "hg38".

## CNV frequency plot by genome

### Input is `pgxfreq` object

```{r, fig.width=7, fig.height=5}
pgxFreqplot(freq_pgxfreq, filters="pgx:icdom-85003")
```

### Input is `pgxmatrix` object 

```{r, fig.width=7, fig.height=5}
pgxFreqplot(freq_pgxmatrix, filters = "NCIT:C3512")
```

## CNV frequency plot by chromosomes

```{r, fig.width=7, fig.height=5}
pgxFreqplot(freq_pgxfreq, filters='NCIT:C4038',chrom=c(1,2,3), layout = c(3,1))  
```

## CNV frequency circos plot 

```{r,fig.width=6, fig.height=6}
pgxFreqplot(freq_pgxfreq, filters='pgx:icdom-85003', circos = TRUE)
```

The circos plot also supports multiple group comparison

```{r,fig.width=6, fig.height=6}
pgxFreqplot(freq_pgxfreq,filters= c("NCIT:C4038","pgx:icdom-85003"),circos = TRUE) 
```

## Highlight interesting genomic intervals

If you want to look at the CNV frequency at specific genomic bins, you can use `highlight` parameter.
For example, when you are interested in CNV frequency of CCND1 gene in samples with infiltrating duct carcinoma (icdom-85003).
You could first find the genomic bin where CCND1 (chr11:69641156-69654474) is located.

```{r}
# Extract the CNV frequency data frame of samples from 'icdom-85003' from 
# the previously returned object
freq_IDC <- freq_pgxfreq[['pgx:icdom-85003']]
# search the genomic bin where CCND1 is located
bin <- which(seqnames(freq_IDC) == 11 & start(freq_IDC) <= 69641156 &  
             end(freq_IDC) >= 69654474)
freq_IDC[bin,]
```

Then you could highlight this genomic bin like this

```{r, fig.width=7, fig.height=5}
pgxFreqplot(freq_pgxfreq,filters = 'pgx:icdom-85003', chrom = 11,highlight = bin)
```

Note: For CNV analysis of specific genes, the highlighted plot is rough as a reference, 
because the bin size in frequency plots is 1MB, which is possible to cover multiple genes.

The highlighting is also available for genome plots and circos plots. And you could highlight 
multiple bins by a vector of indices.

```{r, fig.width=7, fig.height=5}
pgxFreqplot(freq_pgxfreq,filters = 'pgx:icdom-85003',highlight = c(1:100))
```

# Session Info

```{r echo = FALSE}
sessionInfo()
```