load example data
To demonstrate the use of ZygosityPredictor, NGS data from the Seq-C2
project were used [1]. In the following chunk, all required datalayer of
the WGS_LL_T_1 sample are loaded. The variants are loaded as GRanges
objects, one for somatic copy number alterations (GR_SCNA), one for
germline- and one for somatic small variants (GR_GERM_SMALL_VARS and
GR_SOM_SMALL_VARS). The input formats will be discussed in more detail
in section 4.
library(ZygosityPredictor)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
library(GenomicRanges)
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following object is masked from 'package:dplyr':
##
## explain
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following object is masked from 'package:dplyr':
##
## combine
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## Loading required package: GenomeInfoDb
# file to sequence alignment
FILE_BAM <- system.file("extdata", "ZP_example.bam",
package = "ZygosityPredictor")
VCF <- system.file("extdata", "ZP_example_chr7.vcf.gz",
package = "ZygosityPredictor")
# meta information of the sample
PURITY <- 0.98
PLOIDY <- 1.57
SEX <- "female"
# variants
data("GR_SCNA")
data("GR_GERM_SMALL_VARS")
data("GR_SOM_SMALL_VARS")
# used gene model
data("GR_GENE_MODEL")
data("GR_HAPLOBLOCKS")
Calculation of affected copies of a variant
Two functions are provided to calculate how many copies are affected
by single small variants, based on two formulas, one for germline
variants and one for somatic variants.
Germline variants
To calculate the affected copies for a germline variant by using
aff_germ_copies()
, the following inputs are required:
- af: numeric; between 0 and 1; calculated allele
frequency of the variant in the tumor sample
- tcn: numeric; total copy number at the position of
the variant
- purity: numeric; between 0 and 1; purity or tumor
cell content of the tumor sample
- c_normal: numeric; expected copy number at the
position of the variant in normal tissue, 1 for gonosomes in male
samples, and 2 for male autosomes and all chromosomes in female samples.
(The function can also assess the c_normal parameter by itself, but then
the following two inputs must be provided: chr and sex)
- chr: (only if c_normal is not provided) character;
can be either a single number or in the “chr1” format; chromosome of the
variant
- sex: (only if c_normal is not provided) character;
either “male” or “female” / “m” or “f”; sex of the sample
- af_normal: (default 0.5) numeric; allele-frequency
of germline variant in normal tissue. 0.5 represents heterozygous
variants in diploid genome, 1 would be homozygous. Could be relevant if
germline CNVs are present at the position. Then also the c_normal
parameter would have to be adjusted.
the output is a numeric value that indicates the affected copies.
## as an example we take the first variant of our prepared input data and
## extract the required information from different input data layer
## the allele frequency and the chromosome can be taken from the GRanges object
AF = elementMetadata(GR_GERM_SMALL_VARS[1])[["af"]]
CHR = seqnames(GR_GERM_SMALL_VARS[1]) %>%
as.character()
## the total copy number (tcn) can be extracted from the CNV object by selecting
## the CNV from the position of the variant
TCN = elementMetadata(
subsetByOverlaps(GR_SCNA, GR_GERM_SMALL_VARS[1])
)[["tcn"]]
## purity and sex can be taken from the global variables of the sample
## with this function call the affected copies are calculated for the variant
aff_germ_copies(af=AF,
tcn=TCN,
purity=PURITY,
chr=CHR,
sex=SEX)
## [1] 1.50733
Somatic variants
To calculate how many copies are affected by a somatic variant by
aff_som_copies()
, the same inputs are required, but a
different formula is evaluated:
## the function for somatic variants works the same way as the germline function
AF = elementMetadata(GR_SOM_SMALL_VARS[1])[["af"]]
CHR = seqnames(GR_SOM_SMALL_VARS[1]) %>%
as.character()
TCN = elementMetadata(
subsetByOverlaps(GR_SCNA, GR_SOM_SMALL_VARS[1])
)[["tcn"]]
aff_som_copies(af=AF,
chr=CHR,
tcn=TCN,
purity=PURITY,
sex=SEX)
## [1] 1.471406
Calculate affected copies of a set of variants
In order to apply the previously mentioned functions to a whole set
of variants and calculate the affected copies, the following code can be
used.
## as an example we calculate the affected copies for the somatic variants:
GR_SOM_SMALL_VARS %>%
## cnv information for every variant is required.. merge with cnv object
IRanges::mergeByOverlaps(GR_SCNA) %>%
as_tibble() %>%
## select relevant columns
select(chr=1, pos=2, gene, af, tcn) %>%
mutate_at(.vars=c("tcn", "af"), .funs=as.numeric) %>%
rowwise() %>%
mutate(
aff_copies = aff_som_copies(chr, af, tcn, PURITY, SEX),
wt_copies = tcn-aff_copies
)
## # A tibble: 10 × 7
## # Rowwise:
## chr pos gene af tcn aff_copies wt_copies
## <fct> <int> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 chr6 29100982 OR2J1 0.358 4.06 1.47 2.59
## 2 chr6 29101487 OR2J1 0.328 4.06 1.35 2.72
## 3 chr6 29101522 OR2J1 0.328 4.06 1.35 2.72
## 4 chr7 107767595 SLC26A3 0.0263 2.49 0.0665 2.42
## 5 chr7 107774037 SLC26A3 0.0476 2.49 0.120 2.37
## 6 chr16 48216115 ABCC11 0.352 3.07 1.10 1.97
## 7 chr16 48231866 ABCC11 0.347 3.07 1.08 1.99
## 8 chrX 88753422 CPXCR1 0.5 1.12 0.579 0.539
## 9 chrX 88753806 CPXCR1 0.491 1.12 0.569 0.549
## 10 chr17 41771694 JUP 0.857 1.06 0.947 0.117
There s also the function predict_per_variant
that
basically does the same with slightly adjusted inputs.
predict_per_variant(purity=PURITY,
sex=SEX,
somCna=GR_SCNA,
somSmallVars=GR_SOM_SMALL_VARS)
## Warning in FUN(X[[i]], ...): column gt_cna contains annotations of balanced
## segments. They are removed from imbalance phasing if enabled
## Warning in FUN(X[[i]], ...): column gt_cna contains annotations of balanced
## segments. They are removed from imbalance phasing if enabled
## Warning in FUN(X[[i]], ...): column gt_cna contains annotations of balanced
## segments. They are removed from imbalance phasing if enabled
## Warning in check_gr_small_vars(germSmallVars, "germline", ZP_env): Input
## germSmallVars empty/does not contain variants. Assuming there are no germline
## small variants
## Warning in check_opt_incdel(includeIncompleteDel, ploidy): Large scale
## deletions cannot be included without ploidyPlease provide input ploidy. Provide
## ploidy=2to assume diploid case
## Warning in check_opt_incdel(includeHomoDel, ploidy): Large scale deletions
## cannot be included without ploidyPlease provide input ploidy. Provide
## ploidy=2to assume diploid case
## $evaluation_per_variant
## # A tibble: 10 × 19
## gene mut_id pos ref alt af tcn cna_type all_imb gt_cna seg_id
## <chr> <chr> <int> <chr> <chr> <dbl> <dbl> <chr> <lgl> <chr> <int>
## 1 OR2J1 m1 2.91e7 C A 0.358 4.06 HZ FALSE <NA> 5
## 2 OR2J1 m2 2.91e7 T C 0.328 4.06 HZ FALSE <NA> 5
## 3 OR2J1 m3 2.91e7 C T 0.328 4.06 HZ FALSE <NA> 5
## 4 SLC26A3 m1 1.08e8 C T 0.0263 2.49 LOH FALSE 2:sub 7
## 5 SLC26A3 m2 1.08e8 G C 0.0476 2.49 LOH FALSE 2:sub 7
## 6 ABCC11 m1 4.82e7 G T 0.352 3.07 HZ TRUE 1:2 1
## 7 ABCC11 m2 4.82e7 C T 0.347 3.07 HZ TRUE 1:2 1
## 8 CPXCR1 m1 8.88e7 A C 0.5 1.12 HZ FALSE <NA> 8
## 9 CPXCR1 m2 8.88e7 G A 0.491 1.12 HZ FALSE <NA> 8
## 10 JUP m1 4.18e7 GTGT… G 0.857 1.06 LOH FALSE 1:0 2
## # ℹ 8 more variables: tcn_assumed <lgl>, origin <chr>, chr <fct>, class <chr>,
## # aff_cp <dbl>, wt_cp <dbl>, vn_status <dbl>, pre_info <chr>
##
## $combined_uncovered
## NULL
Predict Zygosity
In this section, we will use the WGS_LL_T_1 dataset from the Seq-C2
project as an example to investigate whether mutations in the following
genes result in total absence of wildtype copies. The genes which were
selected as an example for the analysis are shown below. The example
data set was reduced to these genes.
- TP53
- BRCA1
- TRANK1
- TRIM3
- JUP
- CDYL
- SCRIB
- ELP2
Predict zygosity for a set of genes in a sample
The prediction of zygosity for a set of genes in a sample can be
assessed by the predict_zygosity()
function.
Important note: The runtime of the analysis depends
strongly on the number of genes to be assessed and on the number of
input variants. It is therefore recommended to reduce the number of
genes to the necessary ones. Also, depending on the research question to
be addressed, the variants should be filtered to the most relevant ones,
not only because of runtime considerations, but also to sharpen the
final result. A large number of mutations in a gene, some of which are
of little biological relevance or even SNPs, will inevitably reduce the
validity of the results.
fp <- predict_zygosity(
purity = PURITY,
ploidy = PLOIDY,
sex = SEX,
somCna = GR_SCNA,
somSmallVars = GR_SOM_SMALL_VARS,
germSmallVars = GR_GERM_SMALL_VARS,
geneModel = GR_GENE_MODEL,
bamDna = FILE_BAM,
vcf=VCF,
haploBlocks = GR_HAPLOBLOCKS
)
## Warning in check_gr_gene_model(geneModel, ZP_env):
## Warning in FUN(X[[i]], ...): column gt_cna contains annotations of balanced
## segments. They are removed from imbalance phasing if enabled
## Warning in FUN(X[[i]], ...): column gt_cna contains annotations of balanced
## segments. They are removed from imbalance phasing if enabled
## Warning in FUN(X[[i]], ...): column gt_cna contains annotations of balanced
## segments. They are removed from imbalance phasing if enabled
## no RNA file provided: Analysis will be done without RNA reads
Interpretation of results
Of note, the results displayed here were chosen to explain and
exemplify the functionality of the tool; biological and medical impact
of the specific variants has not been a selection criterion. The result
which is returned by the function consists of a list of tibbles:
- Evaluation per variant
- Evaluation per gene
- Main Phasing info
- detailed read-level phasing (RLP) info
- detailed alleleic imbalance phasing (AIP) info, if enabled
- Read pair info (only if showReadDetail=TRUE and logDir is
provided)
- Variants not covered by somCna (only if present and no sCNA gap
assumption was done)
One way of accessing the results is a simple extraction of the
tibbles from the list. In addition, two accessor function are
implemented: ZP_ov()
shows an overview of the full
resultand and gene_ov()
shows more detailed information
about a selected gene.
## Zygosity Prediction of 12 input variants
## 0 are uncovered by sCNA input and were not evaluated
## 7 genes analyzed:
## |status |eval_by | n|
## |:-------------------|:--------------|--:|
## |all_copies_affected |aff_cp | 2|
## |wt_copies_left |aff_cp | 2|
## |wt_copies_left |direct-phasing | 1|
## |wt_copies_left |insufficient | 2|
## |undefined |NA | 0|
## phased genes:
## CPXCR1, OR2J1
The function provides an overview about the evaluations done by
ZygosityPredictor. One can see that in our case 12 input small variants
were used to predict gene status. Of those, two got the status
all_copies_affected.
Evaluation per variant
The first result of the function is the evaluation per variant. In
this step all information required for subsequent steps is annotated and
the affected copies per variant are calculated. For every variant, the
function checks whether it already affects all copies of the gene. The
format of the output is a tibble; the number of rows corresponds to the
total number of input variants. The tool annotates a few
self-explanatory columns such as the origin of the respective variant
(germline/somatic) or the class (snv/ins/del). It also appends
information from the sCNA results: the total copy number at the position
of the variant and the information if a loss of heterozygosity is
present (cna_type). Also, an ID is assigned to every small variant.
Then, the genes are numbered consecutively in order to unambiguously
assign variants to genes in the following analysis. The most important
results of this step are the calculation of the affected and wildtype
copies, as well as, depending on the data, an initial check of whether a
variant already affects all copies.
Of note, there can be situations in which left wildtype copies are
below 0.5,
but still this information is not sufficient to predict
“all_copies_affected” without doubt. Depending on the origin of
the variant, further criteria must be met (e.g., LOH). The procedure for
this first check is shown in the pre_info column.
Evaluation per gene + phasing info
By using gene_ov()
, more detailed information can be
viewed about how the tool came to a gene status. What the accessor
function does is basically filtering the output tibbles to the gene of
relevance. Detailed explanation about all columns can be found below the
next chunk.
## Top level: Gene status
## |gene | n_mut|status | conf|eval_by | wt_cp|warning |wt_cp_range |info |phasing | eval_time_s|
## |:-----|-----:|:--------------|----:|:-------|--------:|:-------|:-----------|:----------------------------------------|:--------------|-----------:|
## |OR2J1 | 3|wt_copies_left | 1|aff_cp | 2.592225|none |1.24 - 2.72 |most relevant phasing combination: m1-m2 |direct-phasing | 2.189|
##
##
## Sub level: Evaluation per variant
## |gene |origin |class |chr | pos|ref |alt | af| tcn|tcn_assumed |cna_type |all_imb |gt_cna | seg_id| aff_cp| wt_cp|mut_id |
## |:-----|:-------|:-----|:----|--------:|:---|:---|----:|----:|:-----------|:--------|:-------|:------|------:|------:|-----:|:------|
## |OR2J1 |somatic |snv |chr6 | 29100982|C |A | 0.36| 4.06|FALSE |HZ |FALSE |NA | 5| 1.47| 2.59|m1 |
## |OR2J1 |somatic |snv |chr6 | 29101487|T |C | 0.33| 4.06|FALSE |HZ |FALSE |NA | 5| 1.35| 2.72|m2 |
## |OR2J1 |somatic |snv |chr6 | 29101522|C |T | 0.33| 4.06|FALSE |HZ |FALSE |NA | 5| 1.35| 2.72|m3 |
##
##
## Sub level: Main phasing combinations
## |comb | nconst|const |phasing |via | conf|unplausible |subclonal | wt_cp| min_poss_wt_cp| max_poss_wt_cp| score|gene |
## |:-----|------:|:-----|:--------------|:---|----:|:-----------|:---------|-----:|--------------:|--------------:|-----:|:-----|
## |m2-m3 | 1|same |direct-phasing |8 | 1.00|0 |0 | 2.72| 1.37| 2.72| 1|OR2J1 |
## |m1-m2 | 1|same |direct-phasing |4 | 0.86|0 |0 | 2.59| 1.25| 2.59| 1|OR2J1 |
## |m1-m3 | 1|same |direct-phasing |7 | 0.86|0 |0 | 2.59| 1.24| 2.59| 1|OR2J1 |
##
##
## Sub level: All read-level-phasing combinations, including SNPs
## Showing 3 of 3 phasing attempts
##
## | both| mut1| mut2| dev_var| skipped|const | nconst| p_same| p_diff| none_raw| DNA_rds| RNA_rds|subclonal |unplausible | dist|class_comb |comb | ncomb|gene | conf|
## |-----:|----:|----:|-------:|-------:|:-----|------:|------:|------:|--------:|-------:|-------:|:---------|:-----------|----:|:----------|:-----|-----:|:-----|----:|
## | 2.00| 0| 0| 0| 0|same | 1| 1| 0.14| 0| 2| 0|FALSE |FALSE | 505|snv-snv |m1-m2 | 4|OR2J1 | NA|
## | 2.00| 0| 0| 0| 0|same | 1| 1| 0.14| 0| 2| 0|FALSE |FALSE | 540|snv-snv |m1-m3 | 7|OR2J1 | NA|
## | 39.99| 0| 0| 0| 0|same | 1| 1| 0.00| 22| 62| 0|FALSE |FALSE | 35|snv-snv |m2-m3 | 8|OR2J1 | NA|
##
The first prompted tibble originates from the eval_per_gene tibble of
the output. The tibble shown below originates from eval_per_variant and
shows all input small variants of the gene. The next tibble originates
from main_phasing_info and shows the main phasing combinations. Finally,
the last tibble comes from detailed_RLP_info and contains every phasing
combination including the ones with and between SNPs.
What can be seen is that the final gene status of OR2J1 is
wt_copies_left. The tool predicted around 2.6 remaining wt copies with
maximum confidence level of 1. Below it can be seen that 3 small
variants inside the gene were in the input leading to 3 main phasing
combinations. All of them were phased via direct read level phasing. As
all main combinations could be solved, the left wt copies could be
accurately defined and the gene status assigned. In this particular
case, the gene status could have been solved even without phasing all
main combinations by the case we refer to as insufficient. As the final
gene status is defnied via the so called
integrated_affected_copies
, which are defnied for diff
constellation via: aff_cp(m1)+aff_cp(m2)
and for the same
constellation: max(aff_cp(m1), aff_cp(m2))
, we can
pre-calculate the maximum affected copies in case of a diff
constellation. If the difference of the total copy number and the
integrated affected copies leaves more than 0.5 wt copies, the status
wt_cp_left can be concluded. As can be seen in the main phasing
combinations in column: min_poss_wt_cp, none of the main phasing
combinations could lead to less than 0.5 wt copies even if the variants
were found on different reads.
The outputs contain columns providing more detailed information about
gene status definition which will be explained in the following.
All read-level phasing combinations
(either visible via gene_ov()
or accessable via
fp$detailed_RLP_info
):
- both: number of reads classified as both (both
variants present), adjusted with basecall and maping quality
- mut1: number of reads classified as mut1 (only the
first variant of combination pßresent), adjusted with basecall and
maping quality
- mut2: number of reads classified as mut2 (only the
second variant of combination pßresent), adjusted with basecall and
maping quality
- dev_var: number of reads having another variant at
one of the positions of the expected variants
- skipped: number of reads not mapping to the
position of the variant (only expected with RNA reads)
- nstatus: variant constellation in numeric
representation (2 = all copies affected, 1 = wt copies left, 0 =
undefined)
- status: variant constellation in character
representation
- nstatus: numeric representation of
constellation
- xsq_same/xsq_diff: X-squared values of chi-squared
tests. _same is the one against expected same result
- p_same/p_diff: p-value of chi-squared tests
- v_same/v_diff: Cramers V value of chi-squared tests
(5)
- none: number of reads classified as none (none of
both variants found)
- DNA_rds/RNA_rds: number of DNA/RNA reads used
- dist: Distance between variants
- class_comb: mutational classes corresponding to
character combination identifier
- comb: combination identifier in character
representation (mX-mY).
- ncomb: combination identifier in numeric
representation. Derived from position in phasing matrix
- subclonal: TRUE if reads with classification both
and either mut1 or mut2 were found. There are two explanantions why
soimething like this could happen in a cancer sample: First that the
tumor is not fully clonal and a subclone is present carrying only the
variant that happened earlier during tumor development. The second
variant is only carried by a subclone that emerged later. The second
explanation could be that the reads are actually from the same cells but
a duplication of one allel happend between their occurence. We could
imagine a scenario were the genotype of a gene is 1:2 and both variants
are on the duplicated allele. If one of them happend before the
duplication, we would expect reads carrying only the earlier variant but
also reads carrying both of them.
- unplausible TRUE if reads with classification both,
mut1 and mut2 are found for the same variant combination. Compared to
the “subclonal” case this is more difficult to explain and might also be
an artifact. The only way this could happen in a cancer sample is that
the same mutational event happens independently on both alleles and or
in different subclones. This case is generally very unlikely and is
therefore annotated as unplasuible. Such a setting is also reflected in
the p-valkue and thereofre the confidence, as similary with both
expected results is found.
Main phasing combinations (either visible via gene_ov()
or accessable via fp$phasing_info
):
- phasing: Info which phasing approach was used
(direct, indirect, haploblock, imbalance, flagged)
- via: combinations which were used to solve this
combination. For direct phasing the number is the numeric representation
of the combi9nation identifier. For indirect phasing it will look like
this: 4-7-8 which measn thta the combinations 4, 7 and 8 were used to
solve.
- conf: aggregated confidence for the
combination
- wt_cp: exact left wt copies. If phasing of
combination was not succesful, wt copies can not be calculated acurately
which leads to NA
- min_poss_wt_cp: Minimal possible wt copies if
variants are on different copies according to affected copynumber of the
two variants in the combination. If greater than 0.5, the constellation
is unable to contributre to a status of all copies affrected for the
gene.
- max_poss_wt_cp: maximum possible wt copies if the
two variants are on the same copy
- score: contribution to gene status. (2 is a
contribution of all copies affected by the combination, 1 has more than
0.5 wt copies left)
- gene: Gene that w2as tried to solve with the
combination