Progenetix is an open data resource that provides curated individual cancer copy number variation (CNV) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette offers a comprehensive guide on accessing and visualizing CNV frequency data within the Progenetix database. CNV frequency is pre-calculated based on CNV segment data in Progenetix and reflects the CNV pattern in a cohort. It is defined as the percentage of samples showing a CNV for a genomic region (1MB-sized genomic bins in this case) over the total number of samples in a cohort specified by filters.
If your focus lies in cancer cell lines, you can access data from cancercelllines.org by
setting the domain
parameter to “https://cancercelllines.org” in pgxLoader
function. This data repository originates from CNV profiling data of
cell lines initially collected as part of Progenetix and currently
includes additional types of genomic mutations.
library(pgxRpi)
library(SummarizedExperiment) # for pgxmatrix data
library(GenomicRanges) # for pgxfreq data
pgxLoader
functionThis function loads various data from Progenetix
database via the Beacon v2 API with some extensions (BeaconPlus).
The parameters of this function used in this tutorial:
type
: A string specifying output data type. Only
“cnv_frequency” is used in this tutorial.output
: A string specifying output file format. When
the parameter type
is “cnv_frequency”, available options
are “pgxfreq” or “pgxmatrix”.filters
: Identifiers used in public repositories,
bio-ontology terms, or custom terms such as c(“NCIT:C7376”,
“PMID:22824167”). When multiple filters are used, they are
combined using OR logic when the parameter type
is
“cnv_frequency”. For more information about filters, see the documentation.dataset
: A string specifying the dataset to query from
the Beacon response. Default is NULL, which includes results from all
datasets.output
= “pgxfreq”)freq_pgxfreq <- pgxLoader(type="cnv_frequency", output ="pgxfreq",
filters=c("NCIT:C3058","NCIT:C3493"))
freq_pgxfreq
#> GRangesList object of length 2:
#> $`NCIT:C3058`
#> GRanges object with 3106 ranges and 4 metadata columns:
#> seqnames ranges strand | low_gain_frequency
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] 1 0-400000 * | 4.05
#> [2] 1 400000-1400000 * | 8.65
#> [3] 1 1400000-2400000 * | 12.90
#> [4] 1 2400000-3400000 * | 12.55
#> [5] 1 3400000-4400000 * | 16.60
#> ... ... ... ... . ...
#> [3102] Y 52400000-53400000 * | 0.95
#> [3103] Y 53400000-54400000 * | 0.95
#> [3104] Y 54400000-55400000 * | 0.95
#> [3105] Y 55400000-56400000 * | 0.95
#> [3106] Y 56400000-57227415 * | 0.95
#> high_gain_frequency low_loss_frequency high_loss_frequency
#> <numeric> <numeric> <numeric>
#> [1] 0.15 5.70 0.20
#> [2] 0.20 8.10 0.05
#> [3] 0.20 9.40 0.05
#> [4] 0.35 12.80 0.30
#> [5] 0.35 16.95 1.00
#> ... ... ... ...
#> [3102] 0 1 0
#> [3103] 0 1 0
#> [3104] 0 1 0
#> [3105] 0 1 0
#> [3106] 0 1 0
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
#>
#> $`NCIT:C3493`
#> GRanges object with 3106 ranges and 4 metadata columns:
#> seqnames ranges strand | low_gain_frequency
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] 1 0-400000 * | 2.270
#> [2] 1 400000-1400000 * | 5.728
#> [3] 1 1400000-2400000 * | 5.160
#> [4] 1 2400000-3400000 * | 8.617
#> [5] 1 3400000-4400000 * | 7.276
#> ... ... ... ... . ...
#> [3102] Y 52400000-53400000 * | 0
#> [3103] Y 53400000-54400000 * | 0
#> [3104] Y 54400000-55400000 * | 0
#> [3105] Y 55400000-56400000 * | 0
#> [3106] Y 56400000-57227415 * | 0
#> high_gain_frequency low_loss_frequency high_loss_frequency
#> <numeric> <numeric> <numeric>
#> [1] 0.619 4.696 0.464
#> [2] 0.516 12.745 0.103
#> [3] 0.155 13.261 0.155
#> [4] 1.548 24.768 0.619
#> [5] 0.361 25.129 0.516
#> ... ... ... ...
#> [3102] 0 0.103 0
#> [3103] 0 0.103 0
#> [3104] 0 0.103 0
#> [3105] 0 0.103 0
#> [3106] 0 0.103 0
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
The returned data is stored in GRangesList container which consists of multiple GRanges objects. Each GRanges object stores CNV frequency from samples pecified by a particular filter. Within each GRanges object, there are annotation columns that provide information on CNV frequencies for each row, expressed as percentage values across samples (%). These percentages represent level-specific gains and losses that overlap the corresponding genomic intervals.
These genomic intervals are derived from the partitioning of the entire genome (GRCh38). Most of these bins have a size of 1MB, except for a few bins located near the telomeres. In total, there are 3106 intervals encompassing the genome.
To access the CNV frequency data from specific filters, you could access like this
freq_pgxfreq[["NCIT:C3058"]]
#> GRanges object with 3106 ranges and 4 metadata columns:
#> seqnames ranges strand | low_gain_frequency
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] 1 0-400000 * | 4.05
#> [2] 1 400000-1400000 * | 8.65
#> [3] 1 1400000-2400000 * | 12.90
#> [4] 1 2400000-3400000 * | 12.55
#> [5] 1 3400000-4400000 * | 16.60
#> ... ... ... ... . ...
#> [3102] Y 52400000-53400000 * | 0.95
#> [3103] Y 53400000-54400000 * | 0.95
#> [3104] Y 54400000-55400000 * | 0.95
#> [3105] Y 55400000-56400000 * | 0.95
#> [3106] Y 56400000-57227415 * | 0.95
#> high_gain_frequency low_loss_frequency high_loss_frequency
#> <numeric> <numeric> <numeric>
#> [1] 0.15 5.70 0.20
#> [2] 0.20 8.10 0.05
#> [3] 0.20 9.40 0.05
#> [4] 0.35 12.80 0.30
#> [5] 0.35 16.95 1.00
#> ... ... ... ...
#> [3102] 0 1 0
#> [3103] 0 1 0
#> [3104] 0 1 0
#> [3105] 0 1 0
#> [3106] 0 1 0
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
To get metadata such as count of samples used to calculate frequency,
use mcols
function from GenomicRanges package:
output
= “pgxmatrix”)Choose 8 NCIt codes of interests that correspond to different tumor types
code <-c("C3059","C3716","C4917","C3512","C3493","C3771","C4017","C4001")
# add prefix for query
code <- sub(".",'NCIT:C',code)
load data with the specified codes
freq_pgxmatrix <- pgxLoader(type="cnv_frequency",output ="pgxmatrix",filters=code)
freq_pgxmatrix
#> class: RangedSummarizedExperiment
#> dim: 6212 8
#> metadata(0):
#> assays(2): lowlevel_cnv_frequency highlevel_cnv_frequency
#> rownames(6212): 1 2 ... 6211 6212
#> rowData names(1): type
#> colnames(8): NCIT:C3059 NCIT:C3493 ... NCIT:C4017 NCIT:C4917
#> colData names(3): filter label sample_count
The returned data is stored in RangedSummarizedExperiment object, which is a matrix-like container where rows represent ranges of interest (as a GRanges object) and columns represent filters.
To get metadata such as count of samples used to calculate frequency,
use colData
function from SummarizedExperiment
package:
colData(freq_pgxmatrix)
#> DataFrame with 8 rows and 3 columns
#> filter label sample_count
#> <character> <character> <integer>
#> NCIT:C3059 NCIT:C3059 Glioma 8186
#> NCIT:C3493 NCIT:C3493 Lung Squamous Cell C.. 1938
#> NCIT:C3512 NCIT:C3512 Lung Adenocarcinoma 4664
#> NCIT:C3716 NCIT:C3716 Primitive Neuroectod.. 2214
#> NCIT:C3771 NCIT:C3771 Breast Lobular Carci.. 904
#> NCIT:C4001 NCIT:C4001 Breast Inflammatory .. 27
#> NCIT:C4017 NCIT:C4017 Breast Ductal Carcin.. 10097
#> NCIT:C4917 NCIT:C4917 Lung Small Cell Carc.. 558
To access the CNV frequency matrix, use assay
accesssor
from SummarizedExperiment package
head(assay(freq_pgxmatrix,"lowlevel_cnv_frequency"))
#> NCIT:C3059 NCIT:C3493 NCIT:C3512 NCIT:C3716 NCIT:C3771 NCIT:C4001 NCIT:C4017
#> 1 4.60 2.270 2.55 3.35 0.774 3.704 6.00
#> 2 8.65 5.728 9.30 7.15 1.217 7.407 6.65
#> 3 12.95 5.160 10.35 8.80 1.881 7.407 4.20
#> 4 10.80 8.617 13.75 8.45 3.097 3.704 4.40
#> 5 14.55 7.276 13.50 8.85 2.655 3.704 4.50
#> 6 8.10 7.069 12.80 7.90 1.659 3.704 3.95
#> NCIT:C4917
#> 1 7.527
#> 2 21.147
#> 3 20.430
#> 4 22.939
#> 5 21.864
#> 6 21.685
The matrix has 6212 rows (genomic regions) and 8 columns (filters). The rows comprised 3106 intervals with “gain status” plus 3106 intervals with “loss status”.
The value is the percentage of samples from the corresponding filter having one or more CNV events in the specific genomic intervals. For example, if the value in the second row and first column is 8.65, it means that 8.65% samples from the corresponding filter NCIT:C3059 having one or more duplication events in the genomic interval in chromosome 1: 400000-1400000.
You could get the interval information by rowRanges
function from SummarizedExperiment package.
rowRanges(freq_pgxmatrix)
#> GRanges object with 6212 ranges and 1 metadata column:
#> seqnames ranges strand | type
#> <Rle> <IRanges> <Rle> | <character>
#> 1 1 0-400000 * | gain
#> 2 1 400000-1400000 * | gain
#> 3 1 1400000-2400000 * | gain
#> 4 1 2400000-3400000 * | gain
#> 5 1 3400000-4400000 * | gain
#> ... ... ... ... . ...
#> 6208 Y 52400000-53400000 * | loss
#> 6209 Y 53400000-54400000 * | loss
#> 6210 Y 54400000-55400000 * | loss
#> 6211 Y 55400000-56400000 * | loss
#> 6212 Y 56400000-57227415 * | loss
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
Note: it is different from CNV fraction matrix introduced in Introduction_2_loadvariants. Value in this matrix is percentage (%) of samples having one or more CNVs overlapped with the binned interval while the value in CNV fraction matrix is fraction in individual samples to indicate how much the binned interval overlaps with one or more CNVs in one sample.
segtoFreq
functionThis function computes the binned CNV frequency from segment data.
The parameters of this function:
data
: Segment data containing CNV states. The first
four columns should represent sample ID, chromosome, start position, and
end position, respectively. The fifth column can contain the number of
markers or other relevant information. The column representing CNV
states (with a column index of 6 or higher) should either contain “DUP”
for duplications and “DEL” for deletions, or level-specific CNV states
such as “EFO:0030072”, “EFO:0030071”, “EFO:0020073”, and “EFO:0030068”,
which correspond to high-level duplication, low-level duplication,
high-level deletion, and low-level deletion, respectively.cnv_column_idx
: Index of the column specifying CNV
state. Default is 6, following the “pgxseg” format used in Progenetix.
If the input segment data uses the general .seg
file
format, it might need to be set differently.cohort_name
: A string specifying the cohort name.
Default is “unspecified cohort”.assembly
: A string specifying the genome assembly
version for CNV frequency calculation. Allowed options are “hg19” or
“hg38”. Default is “hg38”.bin_size
: Size of genomic bins used to split the
genome, in base pairs (bp). Default is 1,000,000.overlap
: Numeric value defining the amount of overlap
between bins and segments considered as bin-specific CNV, in base pairs
(bp). Default is 1,000.soft_expansion
: Fraction of bin_size
to
determine merge criteria. During the generation of genomic bins,
division starts at the centromere and expands towards the telomeres on
both sides. If the size of the last bin is smaller than
soft_expansion
* bin_size, it will be merged with the
previous bin. Default is 0.1.Suppose you have segment data from several biosamples:
# access variant data
variants <- pgxLoader(type="g_variants",biosample_id = c("pgxbs-kftvhmz9", "pgxbs-kftvhnqz","pgxbs-kftvhupd"),output="pgxseg")
# only keep segment cnv data
segdata <- variants[variants$variant_type %in% c("DUP","DEL"),]
head(segdata)
#> biosample_id reference_name start end log2 variant_type
#> 1 pgxbs-kftvhmz9 1 3477686 9548738 -0.8340 DEL
#> 2 pgxbs-kftvhmz9 1 12188446 12774437 -0.8604 DEL
#> 3 pgxbs-kftvhmz9 1 24280069 25153245 0.4973 DUP
#> 4 pgxbs-kftvhmz9 1 26555954 28546015 -0.7740 DEL
#> 5 pgxbs-kftvhmz9 1 32300691 32332533 -0.7769 DEL
#> 6 pgxbs-kftvhmz9 1 32397196 32595742 -0.6995 DEL
#> reference_bases alternate_bases variant_state_id variant_state_label
#> 1 . . EFO:0030068 low-level copy number loss
#> 2 . . EFO:0030068 low-level copy number loss
#> 3 . . EFO:0030071 low-level copy number gain
#> 4 . . EFO:0030068 low-level copy number loss
#> 5 . . EFO:0030068 low-level copy number loss
#> 6 . . EFO:0030068 low-level copy number loss
You can then calculate the CNV frequency for this cohort based on the specified CNV calls. The output will be stored in the “pgxfreq” format.
In the following example, the CNV frequency is calculated based on “DUP” and “DEL” calls:
segfreq1 <- segtoFreq(segdata,cnv_column_idx = 6, cohort_name="c1")
segfreq1
#> GRangesList object of length 1:
#> $c1
#> GRanges object with 3106 ranges and 2 metadata columns:
#> seqnames ranges strand | gain_frequency loss_frequency
#> <Rle> <IRanges> <Rle> | <numeric> <numeric>
#> [1] 1 0-400000 * | 0 0.0000
#> [2] 1 400000-1400000 * | 0 0.0000
#> [3] 1 1400000-2400000 * | 0 0.0000
#> [4] 1 2400000-3400000 * | 0 0.0000
#> [5] 1 3400000-4400000 * | 0 33.3333
#> ... ... ... ... . ... ...
#> [3102] Y 52400000-53400000 * | 0 0
#> [3103] Y 53400000-54400000 * | 0 0
#> [3104] Y 54400000-55400000 * | 0 0
#> [3105] Y 55400000-56400000 * | 0 0
#> [3106] Y 56400000-57227415 * | 0 0
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
For level-specific CNV calls, set the cnv_column_idx
parameter to the column containing CNV states represented as EFO codes.
segfreq2 <- segtoFreq(segdata,cnv_column_idx = 9,cohort_name="c1")
segfreq2
#> GRangesList object of length 1:
#> $c1
#> GRanges object with 3106 ranges and 4 metadata columns:
#> seqnames ranges strand | low_gain_frequency
#> <Rle> <IRanges> <Rle> | <numeric>
#> [1] 1 0-400000 * | 0
#> [2] 1 400000-1400000 * | 0
#> [3] 1 1400000-2400000 * | 0
#> [4] 1 2400000-3400000 * | 0
#> [5] 1 3400000-4400000 * | 0
#> ... ... ... ... . ...
#> [3102] Y 52400000-53400000 * | 0
#> [3103] Y 53400000-54400000 * | 0
#> [3104] Y 54400000-55400000 * | 0
#> [3105] Y 55400000-56400000 * | 0
#> [3106] Y 56400000-57227415 * | 0
#> low_loss_frequency high_gain_frequency high_loss_frequency
#> <numeric> <numeric> <numeric>
#> [1] 0.0000 0 0
#> [2] 0.0000 0 0
#> [3] 0.0000 0 0
#> [4] 0.0000 0 0
#> [5] 33.3333 0 0
#> ... ... ... ...
#> [3102] 0 0 0
#> [3103] 0 0 0
#> [3104] 0 0 0
#> [3105] 0 0 0
#> [3106] 0 0 0
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
pgxFreqplot
functionThis function provides CNV frequency plots by genome or chromosomes as you request.
The parameters of this function:
data
: CNV frequency object returned by
pgxLoader
or segtoFreq
functions.chrom
: A vector specifying which chromosomes to plot.
If NULL, the plot will cover the entire genome. #’ If specified, the
frequencies are plotted with one panel for each chromosome. Default is
NULL.layout
: Number of columns and rows in plot. Only used
in plot by chromosome. Default is c(1,1).filters
: Index or string value indicating which filter
to plot. The length of filters is limited to one if the parameter
circos
is FALSE. Default is the first filter.circos
: A logical value indicating whether to return a
circos plot. If TRUE, it returns a circos plot that can display and
compare multiple filters. Default is FALSE.assembly
: A string specifying the genome assembly
version to apply to CNV frequency plotting. Allowed options are “hg19”
and “hg38”. Default is “hg38”.#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.35.5 Biobase_2.67.0
#> [3] GenomicRanges_1.57.2 GenomeInfoDb_1.41.2
#> [5] IRanges_2.39.2 S4Vectors_0.43.2
#> [7] BiocGenerics_0.53.0 MatrixGenerics_1.17.1
#> [9] matrixStats_1.4.1 pgxRpi_1.3.0
#> [11] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 dplyr_1.1.4 farver_2.1.2
#> [4] fastmap_1.2.0 digest_0.6.37 timechange_0.3.0
#> [7] lifecycle_1.0.4 survival_3.7-0 magrittr_2.0.3
#> [10] compiler_4.4.1 rlang_1.1.4 sass_0.4.9
#> [13] tools_4.4.1 utf8_1.2.4 yaml_2.3.10
#> [16] data.table_1.16.2 knitr_1.48 ggsignif_0.6.4
#> [19] S4Arrays_1.5.11 labeling_0.4.3 curl_5.2.3
#> [22] DelayedArray_0.33.1 abind_1.4-8 withr_3.0.2
#> [25] purrr_1.0.2 sys_3.4.3 grid_4.4.1
#> [28] fansi_1.0.6 ggpubr_0.6.0 future_1.34.0
#> [31] xtable_1.8-4 colorspace_2.1-1 ggplot2_3.5.1
#> [34] globals_0.16.3 scales_1.3.0 cli_3.6.3
#> [37] rmarkdown_2.28 crayon_1.5.3 generics_0.1.3
#> [40] future.apply_1.11.3 km.ci_0.5-6 httr_1.4.7
#> [43] survminer_0.4.9 cachem_1.1.0 zlibbioc_1.51.2
#> [46] splines_4.4.1 parallel_4.4.1 BiocManager_1.30.25
#> [49] XVector_0.45.0 survMisc_0.5.6 vctrs_0.6.5
#> [52] Matrix_1.7-1 jsonlite_1.8.9 carData_3.0-5
#> [55] car_3.1-3 rstatix_0.7.2 Formula_1.2-5
#> [58] listenv_0.9.1 maketools_1.3.1 attempt_0.3.1
#> [61] tidyr_1.3.1 jquerylib_0.1.4 parallelly_1.38.0
#> [64] glue_1.8.0 codetools_0.2-20 shape_1.4.6.1
#> [67] lubridate_1.9.3 gtable_0.3.6 UCSC.utils_1.1.0
#> [70] munsell_0.5.1 tibble_3.2.1 pillar_1.9.0
#> [73] htmltools_0.5.8.1 circlize_0.4.16 GenomeInfoDbData_1.2.13
#> [76] R6_2.5.1 KMsurv_0.1-5 evaluate_1.0.1
#> [79] lattice_0.22-6 highr_0.11 backports_1.5.0
#> [82] broom_1.0.7 bslib_0.8.0 SparseArray_1.5.45
#> [85] gridExtra_2.3 xfun_0.48 GlobalOptions_0.1.2
#> [88] zoo_1.8-12 buildtools_1.0.0 pkgconfig_2.0.3