CaMutQC: Cancer Mutation Quality Control

Introduction

The quality control of cancer somatic mutations is of great significance in cancer genomics. It helps to eliminate false positive mutations arisen during the sequencing process, thereby improving the efficiency and accuracy of downstream analysis. Here, we developed an R package CaMutQC, for the quality control and selection of cancer somatic mutations. It offers both common and customized strategies for the filtration of cancer somatic mutations based on the MAF data frame, which also can select key somatic mutations related to tumorigenesis. In addition, we believe that the union of CaMutQC-filtered mutations returned by multiple variant caller contains more true positive somatic mutations than that from a single variant caller or the intersection of multiple callers. The package, source code and documents are freely available through Github (https://github.com/likelet/CaMutQC)

Citation

In R console, enter citation("CaMutQC").

 

Installation

Via GitHub

Install the latest version of CaMutQC by typing the commands below in R console:

if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
}
devtools::install_github("likelet/CaMutQC")

 

An Overview

For now, there are three main functional modules in CaMutQC. The first section is to filter cancer somatic mutations through common strategies, and the following section offers users customized filtration criteria based on cancer types and published papers. CaMutQC is also capable of measuring TMB (Tumor Mutational Burden) through various assays. Required input of most functions in CaMutQC can be obtained by applying vcfToMAF function on VCF files.

MAF data frame with special labels from CaMutQC will be returned after each filtration. And a filter report will be generated, offering detailed and organized information.

 

Input File

Single VCF

VCF is a widely used text file format in bioinformatics for storing gene sequence variations. All VCF files should be annotated by VEP first before analyzing through CaMutQC because annotated VCF files contain more detailed information that has clinical significance. Information about VEP and how to run it on VCF file can be found here.

Multiple VCF

CaMutQC supports VEP annotated multi-sample or multi-caller VCF files as inputs, which should be under the same file path. Supported caller: MuTect2, VarScan2, MuSE.

 

From VCF to MAF

VCF and MAF both are important formats in oncology and bioinformatics, but additional tools are needed when transforming between these two formats. vcfToMAF function in CaMutQC is able to perform this transformation using one line command in a few seconds when the input VCF file is VEP-annotated. In addition, parameter filterGene can filter variants without Hugo Symbol when it is set as TRUE.

library(CaMutQC)
MAFdat <- vcfToMAF(system.file("extdata", "WES_EA_T_1_mutect2.vep.vcf", package="CaMutQC"))
MAFdat[1:5, 1:13]
##   Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position
## 1        AGRN         375790      .     GRCh38       chr1        1049980
## 2     GATAD2B          57459      .     GRCh38       chr1      153806614
## 3      CNOT11          55571      .     GRCh38       chr2      101270077
## 4        ANO7          50636      .     GRCh38       chr2      241190240
## 5       PDCD1           5133      .     GRCh38       chr2      241850724
##   End_Position Strand Variant_Classification Variant_Type Reference_Allele
## 1      1049980      +      Missense_Mutation          SNP                G
## 2    153806615      +                  3'UTR          DEL               TT
## 3    101270077      +                  3'UTR          DEL                T
## 4    241190240      +                 Intron          SNP                C
## 5    241850726      +                  3'UTR          DEL              GCA
##   Tumor_Seq_Allele1 Tumor_Seq_Allele2
## 1                 G                 C
## 2                TT                 -
## 3                 T                 -
## 4                 C                 T
## 5               GCA                 -

Load multi-caller data that consists of several VCF files by setting multiVCF as TRUE.

vcfPath <- system.file("extdata/Multi-caller", package="CaMutQC")
multiVCFs <- vcfToMAF(vcfPath, multiVCF=TRUE)
unique(multiVCFs$Tumor_Sample_Barcode)
## [1] "WES_EA_T_1" "TUMOR"

There are two Tumor_Sample_Barcode(s) after reading two VCF files under the Multi-caller folder.

 

Common filtering strategies

After reading a number of classical papers, we collected, sorted and summarized some widely used parameters and their thresholds when performing cancer somatic mutation filtration. These strategies are implemented through a number of sub-functions that cover widely used criteria like sequencing quality, strand of bias and database selection. Besides, sub-functions are integrated into bigger functions to enable . Each of the functions takes MAF data frame generated by vcfToMAF function in CaMutQC as an input, and returns a labeled MAF data frame as results.

 

Single filtration function

Sub-functions and their corresponding flags

Main function Sub-function Flag
mutFilterTech mutFilterQual Q
mutFilterSB S
mutFilterAdj A
mutFilterNormalDP N
mutFilterPON P
FILTER F
mutSelection mutFilterDB D
mutFilterType T
mutFilterReg R

Note: A variant labeled by certain flag indicates it fails to pass this filter function, and all variants start from tag '0’

 

Sequencing quality filtration

Sequencing quality parameters like allele depth (AD), total depth (DP) and variant allele frequency (VAF) are widely used to filter potential artifacts. To provide more convenience as well as more flexibility, the panel parameter in this function is able to apply a set of filtration strategies related to sequencing quality, where user can choose between panels like WES and MSKCC and they can also set freely under any panel.

Parameters for Customized, WES and MSKCC panel

Parameter Customized panel (default) WES panel MSKCC panel
normalDP 10 10 10
normalAD Inf* 1 1
tumorDP 20 20 20
tumorAD 5 5 10
VAF 0.05 0.05 0.05
VAFratio 0 0 5

*: Inf here means normalAD is not a filtration criterion in this panel

MAF_qual <- mutFilterQual(MAFdat, panel="Customized", VAF=0.01, VAFratio=4)
table(MAF_qual$CaTag)
## 
##  0 0Q 
## 30 57

Here we can see that 57 mutations get an extra Q flag, which means they fail to pass the filtration on sequencing quality with VAF < 0.01 or VAFratio < 4, or both.

 

Strand of Bias filtration

Strand bias occurs when the genotype inferred from information presented by the forward strand and the reverse strand disagrees. A study showed that post-analysis procedures can cause strand bias, which introduce more SNPs with higher strand bias, and in turn result in more false-positive SNPs 1. Therefore, it is necessary to detect and minimize the effect of strand bias.

At present, there are four widely-used methods for strand bias detection. One approach was mentioned in a mitochondria heteroplasmy study 2. And GATK calculates a strand bias score for each SNP identified while Samtools put forwards another strand bias score based on Fisher’s exact test. Additionally, GATK introduced an updated form of the Fisher Strand Test, StrandOddsRatioSOR annotation (SOR), which is believed to be better at measuring strand bias of data in high coverage.

In CaMutQC, either Fisher Strand Test or SOR algorithm can be used to evaluate strand bias and filter variants based on the results. By default, strand bias is detected through SOR algorithm and the cutoff for strand of bias score is set as 3.

MAF_sb <- mutFilterSB(MAFdat, SBscore=2)
table(MAF_sb$CaTag)
## 
##  0 0S 
## 68 19

In our case, 19 mutations are labeled by S flag because CaMutQC believes they have strand bias when the cutoff is set to 2.

 

Adjacent indel filtration

The Adjacent Indel tag is used when a somatic SNP/DNP/TNP was possibly caused by misalignment around a germline or somatic insertion/deletion (indel). By default, CaMutQC filters any SNV within 10 bp of an indel with length <= 50 bp found in the tumor sample.

MAF_adj <- mutFilterAdj(MAFdat, maxIndelLen=40, minInterval=15)
table(MAF_adj$CaTag)
## 
##  0 0A 
## 85  2

There are 2 point mutations labeled by flag A in the above example, because they are within 15 bps of an indel with length <= 40.

 

Normal depth filtration

To avoid miscalling germline variants and to improve the quality of variants 3, CaMutQC supports filtration on normal depth for both dbsnp/non-dbsnp variants, where cutoffs are 19 and 8 respectively.

MAF_normaldp <- mutFilterNormalDP(MAFdat, dbsnpCutoff=19, nonCutoff=8)
table(MAF_normaldp$CaTag)
## 
##  0 
## 87

Based on the results, all mutations pass this filtration under default settings.

 

Panel of Normals filtration

Panel of Normals (PON) is a type of resource used in somatic variant analysis. Basically, if a variant is found in a panel of normals, or is found in more than two normal samples, it is unlikely to be a driven variant during tumorigenesis or tumor development. PON filtration has been widely used in many researches and projects to discard non-driven variants 4 5 6.

A PON data set can be generated by users through sequencing a number of normal samples that are as technically similar as possible to the tumor (same exome or genome preparation methods, sequencing technology and so on). Or, the PON data set can also be directly obtained from GATK, which is viewed as one of the most effective filters for false-positive, contamination, and germline variants 3.

Due to potential copyright issues, PON files are NOT contained in CaMutQC package. But we recommend public GATK panels of normals data as PON files, and they can be easily accessed from GATK resource bundle:

GRCh38: gs://gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz

GRCh37: gs://gatk-best-practices/somatic-b37/Mutect2-exome-panel.vcf

MAF_pon <- mutFilterPON(MAFdat, 
                        PONfile=system.file("extdata", "PON_test.txt", 
                                            package="CaMutQC"), PONformat="txt")
table(MAF_pon$CaTag)
## 
##  0 0P 
## 86  1

Here, we use a random PON file as an example to display how this function works, and 1 mutation is found in the PON file, and thus labeled by P flag.

 

Database filtration

Some database published germline variants and recurrent artifacts in distinct races. In CaMutQC, based on the parameters we collected 3 4 7, potential germline variants is removed based on annotation from those databases (if available) unless the allele frequency of a mutation recorded in those databases is lower than the VAF threshold (0.01) or the CliVar/OMIM/HGMD flags it as pathogenic.

COSMIC (the Catalogue of Somatic Mutations In Cancer) has the most comprehensive resource for exploring the impact of somatic mutations in oncology. The team has assembled a list of genes that are somatically mutated and causally implicated in human cancer 8, which is called the The Cancer Gene Census and is updated periodically with new genes. In VCF file annotated by VEP, the Existing_variation column indicates a gene is in this COSMIC list if it has an annotation ID starts with COSV, COSM or COSN.

By default, CaMutQC filters variants recorded in ExAC, Genomesprojects1000, ESP6500 and gnomAD, and always keeps variants in COSMIC no matter they are present in any germline database or not.

# labels can be added
MAF_db <- mutFilterDB(MAFdat, dbSNP=TRUE, dbVAF=0.01)
table(MAF_db$CaTag)
## 
##  0 0D 
## 46 41

We can see from the results that 41 mutations are labeled by D flag when we set the database VAF cutoff as 0.01 and filter mutations in the dbSNP database, much more than the mutations labeled in previous steps. Since this function is a part of the candidate variant selection process, more mutations might be labeled due to strict conditions and thresholds.

 

Variant type filtration

Most studies relate to cancer somatic mutations keep certain types of variants in order to better target candidate variants, among which exonic and nonsynonymous are two of the most widely used categories for filtration 3 9 10.

In CaMutQC, these two categories can be chosen in this step and exonic is the default option, while nonsynonymous will leave users non-synonymous variants. More details could be found at Ensembl Variation.

  • Variant classifications filtered when set as exonic: RNA, Intron, IGR, 5\'Flank, 3\'Flank, 5\'UTR, 3\'UTR

  • Variant classifications filtered when set as nonsynonymous: 3'UTR, 5\'UTR, 3\'Flank, Targeted_Region, Silent, Intron, RNA, IGR, Splice_Region, 5\'Flank, lincRNA,De_novo_Start_InFrame, De_novo_Start_OutOfFrame, Start_Codon_Ins, Start_Codon_SNP, Stop_Codon_Del

MAF_type <- mutFilterType(MAFdat, keepType='nonsynonymous')
table(MAF_type$CaTag)
## 
##  0 0T 
## 12 75
table(MAF_type$Variant_Classification[which(MAF_type$CaTag == '0')])
## 
##      In_Frame_Del Missense_Mutation 
##                 1                11

75 synonymous mutations are labeled in this step, and the remained nonsynonymous mutations are more likely to be related to cancer development and progress.

 

Region selection

In this step, users are able to further select variants related to cancer development by providing an additional BED file (or a .rds file with a bed variable in it), and variants will be searched only in target regions covered in the BED file. Besides, parameter bedFilter can be set as TRUE to clean the bed file (only leaves segments in Chr1-Chr22, ChrX and ChrY).

MAF_reg <- mutFilterReg(MAFdat, 
                        bedFile=system.file("extdata/bed/panel_hg19", 
                                            "FlCDx-hg19.rds", package="CaMutQC"))
table(MAF_reg$CaTag)
## 
## 0R 
## 87

No mutation is within the target region provided in this case, so all mutations get an R flag.

 

Overall filtration

sub-functions mentioned above are divided into two groups according to their definitions and the categories they belong to, which can be reached through advanced function mutFilterTech and mutSelection respectively. Each advanced function is composed of multiple sub-functions that apply filtration on variants from different aspects but the same category. After passing through the advanced filter function, each variant may be labeled with more than one flag that shows the filtration results.

In addition, mutFilterCom function is an upper function that combines mutFilterTech and mutSelection, so any parameter in sub-functions can be set in mutFilterCom.

 

Potential artifacts filtration

Function mutFilterTech combines filtration strategies for removing potential artifacts, including sequencing quality, strand of bias, normal DP, PON and adjacent indel filtration.

Some variant callers add a tag if a variant pass the post-filtration after calling. With CaMutQC, users can set a standard tag found in the FILTER column of VCF file to keep variants. PASS is set as the default tag.

MAF_tech <- mutFilterTech(MAFdat, panel="Customized", tumorDP=8, minInterval=9, 
                          tagFILTER=NULL, progressbar=FALSE, 
                          PONfile=system.file("extdata", "PON_test.txt", 
                                              package="CaMutQC"), PONformat="txt")
table(MAF_tech$CaTag)
## 
##   0  0A  0P  0Q 0QA  0S 
##  46   1   1  32   1   6

There are 41 mutations labeled by mutFilterTech in the above example, and 1 mutations have 2 flags, suggesting it is more likely to be a false positive under current settings.

 

Candidate variant selection

In most cases, basic filtration by removing potential artifacts is not enough for selecting candidate variants that participate in the formation and development of tumor, because a number of germline variants or variants that do not influence phenotype are still remained in the data set. Therefore, candidate variant selection is a necessary step for downstream analyses.

The whole selection process in CaMutQC is composed of database filtration, variant type filtration and region selection, all incorporated in the mutSelection function.

MAF_selec <- mutSelection(MAFdat, dbVAF=0.02, keepType='nonsynonymous', progressbar=FALSE)
table(MAF_selec$CaTag)
## 
##   0 0DT  0T 
##  12   7  68

12 mutations are selected as candidates by mutSelection after filtering synonymous mutations and mutations with VAF >= 0.02 in databases.

 

Combined function: mutFilerCom

A main function of CaMutQC is mutFilterCom, which integrates all sub-functions into a big function. And it includes other functions that make CaMutQC an interactive and powerful tool, for example, you can export the code, along with the parameters you set by turning on the codelog setting and specify codelogFile.

MAFCom <- mutFilterCom(MAFdat, panel="WES", report=FALSE, TMB=FALSE, progressbar=FALSE,
                       PONfile=system.file("extdata", "PON_test.txt", 
                                           package="CaMutQC"), PONformat="txt")
table(MAFCom$CaTag)
## 
##     0    0Q  0QAT  0QDT  0QPT 0QSDT  0QST   0QT    0T 
##    10    12     2     6     1     1     5    33    17

mutFilterCom function is the combination of mutFilterTech and mutSelection, which labels 77 mutations in our case. The results above clearly show the status of each mutation, offering users much information for further filtration and analyses.

 

Filter report

By default, a vivid and detailed filter report will be saved automatically each time after running mutFilterCom. An example filter report can be found here.

 

TMB calculation

mutFilterCom also supports the calculation of TMB. Details about TMB can be found in Mutational analysis section.

MAFCom_tmb <- mutFilterCom(MAFdat, panel="WES", assay="Customized", report=FALSE, TMB=TRUE, 
                           bedFile=system.file("extdata/bed/panel_hg38", 
                                               "Pan-cancer-hg38.rds", package="CaMutQC"), 
                           PONfile=system.file("extdata", "PON_test.txt", package="CaMutQC"), 
                           PONformat="txt", progressbar=FALSE, verbose=FALSE)
## Warning in calTMB(maf, bedFile = bedFile, assay = assay, genelist = genelist, : Bed files in CaMutQC are not accurate. The result serves only as a reference.
## Method used to calculate TMB: Customized.
## Estimated TMB is: 0.847.

When running mutFilterCom, mutFilterTech or mutSelection, a progress bar and some messages will display by default to notify users how the task goes, as well as some potential issues. Users can turn off the message by setting verbose=FALSE. When TMB=TRUE, the TMB will be calculated using a specific assay and printed out on the screen. TMB is 0.847 in this case.

 

Customized filtration

Cancer type-based filtration

With CaMutQC, users are able to filter and select cancer somatic mutations according to cancer types, where thresholds for parameters all come from classical studies. mutFilterCan function integrates 10 cancer types so far, with different parameters for each cancer type, for a more precise and customized filtration.

Cancer types supported in CaMutQC: COADREAD, BRCA, LIHC, LAML, LCML, UCEC, UCS, BLCA, KIRC, KIRP.

MAFCan <- mutFilterCan(MAFdat, cancerType='LAML', report=FALSE, TMB=FALSE, 
                       progressbar=FALSE, 
                       PONfile=system.file("extdata", "PON_test.txt", 
                                           package="CaMutQC"), PONformat="txt")
table(MAFCan$CaTag)
## 
##   0  0D 0PD  0Q 
##  33  48   1   5

After applying the filtering strategies of Acute myeloid leukemia (LAML), 33 out of 87 mutations are kept.

 

Reference-based filtration

Sometimes, we may want to apply the same set of strategies in another study, to become comparable with it. So far, filtering strategies used in five studies are provided in CaMutQC. By passing one of the references in the correct format into mutFilterRef function, all filtering strategies in that study will be applied automatically on your data.

MAFRef <- mutFilterRef(MAFdat, reference="Zhu_et_al-Nat_Commun-2020-KIRP", 
                       report=FALSE, TMB=FALSE, progressbar=FALSE,
                       PONfile=system.file("extdata", "PON_test.txt", 
                                           package="CaMutQC"), PONformat="txt")
table(MAFRef$CaTag)
## 
##   0  0D 0PD  0Q 0QD 
##  34  30   1  12  10

After applying the same strategies used in Zhu_et_al-Nat_Commun-2020-KIRP, 34 mutations are left without any flag.

Mutational analysis

Tumor Mutational Burden (TMB) refers to the number of somatic non-synonymous mutations per megabase pair (Mb) in a specific genomic region. In 2015, tumor non-synonymous mutation burden was first confirmed to be related to PD1/PD-L1 cancer immunotherapy 11. Through the analysis of mutation burden of patients with non-small cell lung cancer, the clinical response and survival rate and other indicators, researchers confirmed that the higher TMB of cancer patients have, the better effect of tumor immunotherapy would get. This conclusion was subsequently verified in other cancer types such as malignant melanoma 12 and small cell lung cancer 13. Therefore, TMB has become one of the predictive biomarkers of immune checkpoint and inhibitor immunotherapy in cancer treatment 14.

There are many assays for TMB measurement, including WGS, WES, targeted sequencing using gene panels, and sequencing of circulating tumor DNA in tumor samples or blood 15. Different from scientific research, conventional method of calculating TMB in clinical practice is to target-sequence tumor samples, which is to hybridize and capture the exon and intron regions of a certain number of cancer-related genes, without the need for WES sequencing. Currently, the most widely used panels are FoundationOneCDx (F1CDx) and MSK-IMPACT 9. The former only needs to sequence tumor samples, while the latter requires both the tumor sample and its matched normal sample to be sequenced. Both of them have certification from US Food and Drug Administration (FDA).

CaMutQC supports four assays for TMB calculation, including FoundationOne, MSK-IMPACT (3 versions of genelist), Pan-cancer panel 16 and WES. By default, TMB is calculated using MSK-IMPACT method (gene panel version 3, 468 genes). Also, users are free to apply their own methods by setting parameter assay as Customized.

Note: the bed region files mentioned above are generated only from CDS regions, NOT the exact bed region, so the TMB results are only for reference.

tmb_value <- calTMB(MAFdat, assay='Customized', 
                    bedFile=system.file("extdata/bed/panel_hg38","Pan-cancer-hg38.rds", 
                                        package="CaMutQC"))
## Warning in calTMB(MAFdat, assay = "Customized", bedFile = system.file("extdata/bed/panel_hg38", : Bed files in CaMutQC are not accurate. The result serves only as a reference.
tmb_value
## [1] 0.847

TMB value estimated by CaMutQC for this random MAF is 0.847. This is only an example case so it does not have any clinical meaning to be interpreted, but yours may have.

 

Union strategy

After verifying on published data sets, We believed combining CaMutQC-filtered mutations from multiple variant callers is a great approach to better eliminate the bias of single mutation caller while rescuing potential false negative mutations. In this pipeline, the same data set processed by three variant callers (MuSE, (MuTect2 and VarScan2) first goes through CaMutQC filtration respectively and removes labeled mutations. Then processMut function takes three MAF data frames and returns the union of mutations. And processMut can also take intersection of MAFs when asked.

maf_MuSE <- vcfToMAF(system.file("extdata/Multi-caller", 
                                 "WES_EA_T_1.MuSE.vep.vcf", package="CaMutQC")) 
maf_MuSE_f <- mutFilterCom(maf_MuSE, report=FALSE, TMB=FALSE, 
                           PONfile=system.file("extdata", 
                                               "PON_test.txt", package="CaMutQC"), 
                           PONformat = "txt", progressbar=FALSE)
maf_VarScan2 <- vcfToMAF(system.file("extdata/Multi-caller", 
                                     "WES_EA_T_1_varscan_filter_snp.vep.vcf", package="CaMutQC"))
maf_VarScan2_f <- mutFilterCom(maf_VarScan2, report=FALSE, TMB=FALSE, 
                               PONfile=system.file("extdata", 
                                                   "PON_test.txt", package="CaMutQC"), 
                               PONformat="txt", progressbar=FALSE)
MAFdat_f <- mutFilterCom(MAFdat, report=FALSE, TMB=FALSE, 
                         PONfile=system.file("extdata", "PON_test.txt", package= "CaMutQC"), 
                         PONformat="txt", progressbar=FALSE)
mafs <- list(maf_MuSE_f, maf_VarScan2_f, MAFdat_f)
maf_union <- processMut(mafs, processMethod = "union")
maf_union
##    Hugo_Symbol NCBI_Build Chromosome Start_Position End_Position
## 1      C1QTNF2     GRCh38       chr5      160354994    160354994
## 2       PTPN13     GRCh38       chr4       86769842     86769842
## 3       SLC6A7     GRCh38       chr5      150204585    150204585
## 4        BAZ1B     GRCh38       chr7       73469636     73469636
## 5       DNAJA3     GRCh38      chr16        4454855      4454855
## 6      MYBBP1A     GRCh38      chr17        4545725      4545725
## 7        KIF1C     GRCh38      chr17        5007047      5007047
## 8       SH2D3A     GRCh38      chr19        6755018      6755018
## 9         AGRN     GRCh38       chr1        1049980      1049980
## 10       MUC20     GRCh38       chr3      195725999    195725999
## 11       WDR17     GRCh38       chr4      176120051    176120052
## 12     COL22A1     GRCh38       chr8      138737577    138737577
## 13       CCDC7     GRCh38      chr10       32567833     32567833
## 14       FMNL3     GRCh38      chr12       49649096     49649096
## 15        NEMF     GRCh38      chr14       49840710     49840710
## 16      L2HGDH     GRCh38      chr14       50246124     50246125
## 17      CCDC33     GRCh38      chr15       74280669     74280669
## 18         MPG     GRCh38      chr16          79675        79675
## 19      RNF157     GRCh38      chr17       76212499     76212500
## 20     SLC2A11     GRCh38      chr22       23875150     23875150
##    Variant_Classification Variant_Type Reference_Allele
## 1                  Silent          SNP                G
## 2                  Silent          SNP                T
## 3                  Silent          SNP                G
## 4       Nonsense_Mutation          SNP                G
## 5       Nonsense_Mutation          SNP                G
## 6       Missense_Mutation          SNP                T
## 7       Missense_Mutation          SNP                C
## 8       Missense_Mutation          SNP                T
## 9       Missense_Mutation          SNP                G
## 10      Missense_Mutation          SNP                G
## 11      Missense_Mutation          DNP               AG
## 12      Missense_Mutation          SNP                C
## 13      Missense_Mutation          SNP                C
## 14                 Silent          SNP                C
## 15          Splice_Region          SNP                C
## 16          Splice_Region          INS                -
## 17      Missense_Mutation          SNP                G
## 18      Missense_Mutation          SNP                C
## 19        Targeted_Region          INS                -
## 20                 Silent          SNP                G
##                                  Tumor_Seq_Allele2
## 1                                                A
## 2                                                C
## 3                                                C
## 4                                                C
## 5                                                T
## 6                                                C
## 7                                                T
## 8                                                C
## 9                                                C
## 10                                               A
## 11                                              GA
## 12                                               A
## 13                                               T
## 14                                               T
## 15                                               G
## 16                                               A
## 17                                               T
## 18                                               A
## 19 TCCTGACCTCAGGTGATCCATCCGCCTCGGCCTCCCAAAGTGCTGGG
## 20                                               A

Here, three dataset are first converted from VCF to MAF, then filtered by mutFilterCom, and finally taken union. Due to the fact that even the same mutation have different depths, VAFs, etc in different dataset, only 7 columns will be kept after taking union, as displayed above.

 

Call strategy set by your name

Tired of finding or memorizing best parameters? You can share your own filtration strategies/parameters set in the CaMutQC community by opening a new issue with a parameter set label. Every six months, CaMutQC will be updated to include top-rated parameter sets in mutFilterRef function, with the name of author’s Github username. Start using, sharing and contributing NOW!

 

SessionInfo

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] CaMutQC_1.3.0    BiocStyle_2.33.1
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      shape_1.4.6.1           sys_3.4.3              
##   [4] jsonlite_1.8.9          magrittr_2.0.3          ggtangle_0.0.3         
##   [7] farver_2.1.2            rmarkdown_2.28          GlobalOptions_0.1.2    
##  [10] fs_1.6.4                zlibbioc_1.51.2         vctrs_0.6.5            
##  [13] memoise_2.0.1           ggtree_3.13.2           htmltools_0.5.8.1      
##  [16] gridGraphics_0.5-1      pracma_2.4.4            sass_0.4.9             
##  [19] bslib_0.8.0             htmlwidgets_1.6.4       plyr_1.8.9             
##  [22] cachem_1.1.0            buildtools_1.0.0        igraph_2.1.1           
##  [25] lifecycle_1.0.4         iterators_1.0.14        pkgconfig_2.0.3        
##  [28] Matrix_1.7-1            R6_2.5.1                fastmap_1.2.0          
##  [31] gson_0.1.0              clue_0.3-65             GenomeInfoDbData_1.2.13
##  [34] digest_0.6.37           aplot_0.2.3             enrichplot_1.25.5      
##  [37] colorspace_2.1-1        maftools_2.21.3         patchwork_1.3.0        
##  [40] AnnotationDbi_1.69.0    S4Vectors_0.43.2        RSQLite_2.3.7          
##  [43] org.Hs.eg.db_3.20.0     vegan_2.6-8             fansi_1.0.6            
##  [46] httr_1.4.7              mgcv_1.9-1              compiler_4.4.1         
##  [49] withr_3.0.2             bit64_4.5.2             doParallel_1.0.17      
##  [52] BiocParallel_1.39.0     DBI_1.2.3               R.utils_2.12.3         
##  [55] MASS_7.3-61             MesKit_1.15.0           rjson_0.2.23           
##  [58] DNAcopy_1.79.0          permute_0.9-7           tools_4.4.1            
##  [61] ape_5.8                 quadprog_1.5-8          R.oo_1.26.0            
##  [64] glue_1.8.0              nlme_3.1-166            GOSemSim_2.31.2        
##  [67] grid_4.4.1              cluster_2.1.6           reshape2_1.4.4         
##  [70] memuse_4.2-3            fgsea_1.31.6            generics_0.1.3         
##  [73] gtable_0.3.6            R.methodsS3_1.8.2       tidyr_1.3.1            
##  [76] pinfsc50_1.3.0          data.table_1.16.2       utf8_1.2.4             
##  [79] XVector_0.45.0          BiocGenerics_0.51.3     ggrepel_0.9.6          
##  [82] foreach_1.5.2           pillar_1.9.0            stringr_1.5.1          
##  [85] yulab.utils_0.1.7       circlize_0.4.16         splines_4.4.1          
##  [88] dplyr_1.1.4             treeio_1.29.2           lattice_0.22-6         
##  [91] survival_3.7-0          bit_4.5.0               tidyselect_1.2.1       
##  [94] GO.db_3.20.0            ComplexHeatmap_2.21.1   maketools_1.3.1        
##  [97] Biostrings_2.73.2       knitr_1.48              IRanges_2.39.2         
## [100] stats4_4.4.1            xfun_0.48               Biobase_2.65.1         
## [103] matrixStats_1.4.1       DT_0.33                 stringi_1.8.4          
## [106] UCSC.utils_1.1.0        lazyeval_0.2.2          ggfun_0.1.7            
## [109] yaml_2.3.10             evaluate_1.0.1          codetools_0.2-20       
## [112] tibble_3.2.1            qvalue_2.37.0           BiocManager_1.30.25    
## [115] ggplotify_0.1.2         cli_3.6.3               munsell_0.5.1          
## [118] jquerylib_0.1.4         Rcpp_1.0.13             GenomeInfoDb_1.41.2    
## [121] vcfR_1.15.0             png_0.1-8               parallel_4.4.1         
## [124] ggplot2_3.5.1           blob_1.2.4              mclust_6.1.1           
## [127] clusterProfiler_4.13.4  DOSE_3.99.1             phangorn_2.12.1        
## [130] viridisLite_0.4.2       tidytree_0.4.6          ggridges_0.5.6         
## [133] scales_1.3.0            purrr_1.0.2             crayon_1.5.3           
## [136] GetoptLong_1.0.5        rlang_1.1.4             cowplot_1.1.3          
## [139] fastmatch_1.1-4         KEGGREST_1.45.1

 

Reference

  1. Guo Y, Li J, Li CI, Long J, Samuels DC, Shyr Y. The effect of strand bias in Illumina short-read sequencing data. BMC Genomics. 2012;13:666. Published 2012 Nov 24. doi:10.1186/1471-2164-13-666

  2. Guo Y, Cai Q, Samuels DC, et al. The use of next generation sequencing technology to study the effect of radiation therapy on mitochondrial DNA mutation. Mutat Res. 2012;744(2):154-160. doi:10.1016/j.mrgentox.2012.02.006

  3. Ellrott K, Bailey MH, Saksena G, et al. Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. Cell Syst. 2018;6(3):271-281.e7. doi:10.1016/j.cels.2018.03.002

  4. Pereira B, Chin SF, Rueda OM, et al. The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes. Nat Commun. 2016;7:11479. Published 2016 May 10. doi:10.1038/ncomms11479

  5. Brastianos PK, Carter SL, Santagata S, et al. Genomic Characterization of Brain Metastases Reveals Branched Evolution and Potential Therapeutic Targets. Cancer Discov. 2015;5(11):1164-1177. doi:10.1158/2159-8290.CD-15-0369

  6. Sethi NS, Kikuchi O, Duronio GN, et al. Early TP53 alterations engage environmental exposures to promote gastric premalignancy in an integrative mouse model. Nat Genet. 2020;52(2):219-230. doi:10.1038/s41588-019-0574-9

  7. Xue R, Chen L, Zhang C, et al. Genomic and Transcriptomic Profiling of Combined Hepatocellular and Intrahepatic Cholangiocarcinoma Reveals Distinct Molecular Subtypes. Cancer Cell. 2019;35(6):932-947.e8. doi:10.1016/j.ccell.2019.04.007

  8. Futreal PA, Coin L, Marshall M, et al. A census of human cancer genes. Nat Rev Cancer. 2004;4(3):177-183. doi:10.1038/nrc1299

  9. Cheng DT, Mitchell TN, Zehir A, et al. Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets (MSK-IMPACT): A Hybridization Capture-Based Next-Generation Sequencing Clinical Assay for Solid Tumor Molecular Oncology. J Mol Diagn. 2015;17(3):251-264. doi:10.1016/j.jmoldx.2014.12.006

  10. Sakamoto H, Attiyeh MA, Gerold JM, et al. The Evolutionary Origins of Recurrent Pancreatic Cancer. Cancer Discov. 2020;10(6):792-805. doi:10.1158/2159-8290.CD-19-1508

  11. Rizvi NA, Hellmann MD, Snyder A, et al. Cancer immunology. Mutational landscape determines sensitivity to PD-1 blockade in non-small cell lung cancer. Science. 2015;348(6230):124-128. doi:10.1126/science.aaa1348

  12. Snyder A, Makarov V, Merghoub T, et al. Genetic basis for clinical response to CTLA-4 blockade in melanoma [published correction appears in N Engl J Med. 2018 Nov 29;379(22):2185]. N Engl J Med. 2014;371(23):2189-2199. doi:10.1056/NEJMoa1406498

  13. Hellmann MD, Callahan MK, Awad MM, et al. Tumor Mutational Burden and Efficacy of Nivolumab Monotherapy and in Combination with Ipilimumab in Small-Cell Lung Cancer [published correction appears in Cancer Cell. 2019 Feb 11;35(2):329]. Cancer Cell. 2018;33(5):853-861.e4. doi:10.1016/j.ccell.2018.04.001

  14. Lee M, Samstein RM, Valero C, Chan TA, Morris LGT. Tumor mutational burden as a predictive biomarker for checkpoint inhibitor immunotherapy. Hum Vaccin Immunother. 2020;16(1):112-115. doi:10.1080/21645515.2019.1631136

  15. Stenzinger A, Allen JD, Maas J, et al. Tumor mutational burden standardization initiatives: Recommendations for consistent tumor mutational burden assessment in clinical samples to guide immunotherapy treatment decisions. Genes Chromosomes Cancer. 2019;58(8):578-588. doi:10.1002/gcc.22733

  16. Xu Z, Dai J, Wang D, et al. Assessment of tumor mutation burden calculation from gene panel sequencing data. Onco Targets Ther. 2019;12:3401-3409. Published 2019 May 6. doi:10.2147/OTT.S196638