To install this package, start R (version “4.2”) and enter:
# Install via BioConductor
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("katdetectr")
# or the latest version
BiocManager::install("katdetectr", version = "devel")
# or from github
devtools::install_github("https://github.com/ErasmusMC-CCBC/katdetectr")
After installation, you can load the package with:
katdetectr
is an R package for the detection,
characterization and visualization of localized hypermutated regions,
often referred to as kataegis.
The general workflow of katdetectr
can be summarized as
follows:
Please see the Application
Note (under submission) for additional background and details of
katdetectr
. The application note also contains a section
regarding the performance of katdetectr
and other kataegis
detection packages: maftools,
ClusteredMutations,
SeqKat,
kataegis, and SigProfilerClusters.
We have made katdetectr
available on
BioConductor as this insures reliability, and operability on
common operating systems (Linux, Mac, and Windows). We designed
katdetectr
such that it fits well in the
BioConductor ecosystem which allows katdetectr
to
be used easily in combination with other BioConductor packages
and analysis pipelines.
Below, the katdetectr
workflow is performed in a
step-by-step manner on publicly-available datasets that are included
within this package.
Genomic variants from multiple common data-formats (VCF/MAF and
VRanges objects) can be imported into katdetectr
.
# Genomic variants stored within the VCF format.
pathToVCF <- system.file(package = "katdetectr", "extdata/CPTAC_Breast.vcf")
# Genomic variants stored within the MAF format.
pathToMAF <- system.file(package = "katdetectr", "extdata/APL_primary.maf")
# In addition, we can generate synthetic genomic variants including kataegis foci using generateSyntheticData().
# This functions returns a VRanges object.
syntheticData <- generateSyntheticData(nBackgroundVariants = 2500, nKataegisFoci = 1)
Using detectKataegis()
, we can employ changepoint
detection to detect distinct clusters of varying intermutation distance
(IMD), mutation rate and size.
Imported genomic variant data can contain either single or multiple
samples, in the latter case records can be aggregated by setting
aggregateRecords = TRUE
. Overlapping genomic variants
(e.g., an InDel and SNV) are reduced into a single record.
From the genomic variants data the IMD is calculated. Following,
changepoint analysis is performed on the IMD of the genomic variants
which results in segments. Lastly, a segment is labelled as kataegis
foci if the segment fits the following parameters:
minSizeKataegis = 6
and IMDcutoff = 1000
.
# Detect kataegis foci within the given VCF file.
kdVCF <- detectKataegis(genomicVariants = pathToVCF)
# # Detect kataegis foci within the given MAF file.
# As this file contains multiple samples, we set aggregateRecords = TRUE.
kdMAF <- detectKataegis(genomicVariants = pathToMAF, aggregateRecords = TRUE)
# Detect kataegis foci within the synthetic data.
kdSynthetic <- detectKataegis(genomicVariants = syntheticData)
All relevant input and subsequent results are stored within
KatDetect
objects. Using summary()
,
show()
and/or print()
, we can generate
overviews of these KatDetect
object(s).
## Sample name: CPTAC
## Total number of genomic variants: 3684
## Total number of putative Kataegis foci: 9
## Total number of variants in a Kataegis foci: 133
## Sample name: CPTAC
## Total number of genomic variants: 3684
## Total number of putative Kataegis foci: 9
## Total number of variants in a Kataegis foci: 133
## Class "KatDetect" : KatDetect Object
## : S4 class containing 4 slots with names:
## kataegisFoci genomicVariants segments info
##
## Created on: Wed Oct 30 08:37:32 2024
## katdetectr version: 1.9.0
##
## summary:
## --------------------------------------------------------
## Sample name: CPTAC
## Total number of genomic variants: 3684
## Total number of putative Kataegis foci: 9
## Total number of variants in a Kataegis foci: 133
## --------------------------------------------------------
## Class "KatDetect" : KatDetect Object
## : S4 class containing 4 slots with names:
## kataegisFoci genomicVariants segments info
##
## Created on: Wed Oct 30 08:37:32 2024
## katdetectr version: 1.9.0
##
## summary:
## --------------------------------------------------------
## Sample name: CPTAC
## Total number of genomic variants: 3684
## Total number of putative Kataegis foci: 9
## Total number of variants in a Kataegis foci: 133
## --------------------------------------------------------
Underlying data can be retrieved from a KatDetect
objects using the following getter
functions:
getGenomicVariants()
returns: VRanges object. Processed
genomic variants used as input for changepoint detection. This VRanges
contains the genomic location, IMD, and kataegis status of each genomic
variantgetSegments()
returns: GRanges object. Contains the
segments as derived from changepoint detection. This Granges contains
the genomic location, total number of variants, mean IMD and, mutation
rate of each segment.getKataegisFoci()
returns: GRanges object. Contains all
segments designated as putative kataegis foci according the the
specified parameters (minSizeKataegis and IMDcutoff). This Granges
contains the genomic location, total number of variants and mean IMD of
each putative kataegis focigetInfo()
returns: List object. Contains supplementary
information including used parameter settings.## VRanges object with 3684 ranges and 5 metadata columns:
## seqnames ranges strand ref alt totalDepth
## <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
## [1] chr1 935222 * C A 50
## [2] chr1 949608 * G A 50
## [3] chr1 981131 * A G 50
## [4] chr1 982722 * A G 50
## [5] chr1 1164015 * C A 50
## ... ... ... ... ... ... ...
## [3680] chrX 153594977 * G A 50
## [3681] chrX 153627839 * C T 50
## [3682] chrX 153629155 * A G 50
## [3683] chrX 153668757 * G A 50
## [3684] chrX 153764217 * C T 50
## refDepth altDepth sampleNames softFilterMatrix | revmap
## <integerOrRle> <integerOrRle> <factorOrRle> <matrix> | <list>
## [1] 20 30 CPTAC | 1
## [2] 20 30 CPTAC | 2
## [3] 20 30 CPTAC | 3
## [4] 20 30 CPTAC | 4
## [5] 20 30 CPTAC | 5
## ... ... ... ... ... . ...
## [3680] 20 30 CPTAC | 3683
## [3681] 20 30 CPTAC | 3684
## [3682] 20 30 CPTAC | 3685
## [3683] 20 30 CPTAC | 3686
## [3684] 20 30 CPTAC | 3687
## variantID IMD segmentID putativeKataegis
## <integer> <integer> <integer> <logical>
## [1] 1 935222 1 FALSE
## [2] 2 14386 1 FALSE
## [3] 3 31523 1 FALSE
## [4] 4 1591 1 FALSE
## [5] 5 181293 1 FALSE
## ... ... ... ... ...
## [3680] 3680 442 5 FALSE
## [3681] 3681 32862 6 FALSE
## [3682] 3682 1316 6 FALSE
## [3683] 3683 39602 6 FALSE
## [3684] 3684 95460 7 FALSE
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
## hardFilters: NULL
## GRanges object with 452 ranges and 8 metadata columns:
## seqnames ranges strand | segmentID totalVariants
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 1-3389727 * | 1 11
## [2] chr1 3389728-3428608 * | 2 4
## [3] chr1 3428609-19199400 * | 3 22
## [4] chr1 19199401-19203725 * | 4 4
## [5] chr1 19203726-19635011 * | 5 8
## ... ... ... ... . ... ...
## [448] chrX 153577919-153594977 * | 5 6
## [449] chrX 153594978-153668757 * | 6 3
## [450] chrX 153668758-155270560 * | 7 1
## [451] chrY 1-59373566 * | 1 0
## [452] chrM 1-16571 * | 1 0
## firstVariantID lastVariantID meanIMD mutationRate sampleNames
## <integer> <integer> <numeric> <numeric> <character>
## [1] 1 11 308157.00 3.24510e-06 CPTAC
## [2] 12 15 9720.25 1.02878e-04 CPTAC
## [3] 16 37 716854.18 1.39498e-06 CPTAC
## [4] 38 41 1081.25 9.24855e-04 CPTAC
## [5] 42 49 53910.75 1.85492e-05 CPTAC
## ... ... ... ... ... ...
## [448] 3675 3680 2843.17 3.51720e-04 CPTAC
## [449] 3681 3683 24593.33 4.06614e-05 CPTAC
## [450] 3684 3684 1601802.00 6.24297e-07 CPTAC
## [451] <NA> <NA> NA 0.00000e+00 CPTAC
## [452] <NA> <NA> NA 0.00000e+00 CPTAC
## IMDcutoffValues
## <numeric>
## [1] 1000
## [2] 1000
## [3] 1000
## [4] 1000
## [5] 1000
## ... ...
## [448] 1000
## [449] 1000
## [450] 1000
## [451] 1000
## [452] 1000
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
## GRanges object with 9 ranges and 7 metadata columns:
## seqnames ranges strand | fociID sampleNames totalVariants
## <Rle> <IRanges> <Rle> | <integer> <character> <numeric>
## [1] chr3 58108856-58111467 * | 1 CPTAC 7
## [2] chr6 32489708-32489949 * | 2 CPTAC 13
## [3] chr6 32632598-32632770 * | 3 CPTAC 8
## [4] chr6 151669875-151674326 * | 4 CPTAC 7
## [5] chr8 144991205-144999107 * | 5 CPTAC 25
## [6] chr11 62285208-62298597 * | 6 CPTAC 25
## [7] chr14 105405599-105419557 * | 7 CPTAC 23
## [8] chr15 86122654-86124712 * | 8 CPTAC 6
## [9] chr19 4510560-4513559 * | 9 CPTAC 19
## firstVariantID lastVariantID meanIMD IMDcutoff
## <numeric> <integer> <numeric> <numeric>
## [1] 782 788 435.1667 1000
## [2] 1251 1263 20.0833 1000
## [3] 1273 1280 24.5714 1000
## [4] 1358 1364 741.8333 1000
## [5] 1659 1683 329.2500 1000
## [6] 2112 2136 557.8750 1000
## [7] 2591 2613 634.4545 1000
## [8] 2687 2692 411.6000 1000
## [9] 3139 3157 166.6111 1000
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
## $sampleName
## [1] "CPTAC"
##
## $totalGenomicVariants
## [1] 3684
##
## $totalKataegisFoci
## [1] 9
##
## $totalVariantsInKataegisFoci
## [1] 133
##
## $version
## [1] "1.9.0"
##
## $date
## [1] "Wed Oct 30 08:37:32 2024"
##
## $parameters
## $parameters$minSizeKataegis
## [1] 6
##
## $parameters$IMDcutoff
## [1] 1000
##
## $parameters$test.stat
## [1] "Exponential"
##
## $parameters$penalty
## [1] "BIC"
##
## $parameters$pen.value
## [1] 0
##
## $parameters$method
## [1] "PELT"
##
## $parameters$minseglen
## [1] 2
##
## $parameters$aggregateRecords
## [1] FALSE
Per sample, we can visualize the IMD, detected segments and putative kataegis foci as a rainfall plot. In addition, this allows for a per-chromosome approach which can highlight the putative kataegis foci.
As show previously, the IMD cutoff value can be set by the user in
order to identify different types of mutation clusters. However, it is
also possible to define a custom function that determines a IMD cutoff
value for each detected segment. This flexibility allow you to define
your own kataegis definition while still using the
katdetectr
framework. Below we show how to implement a
custom function for the IMD cutoff value. The function below comes from
the work of Pan-Cancer
Analysis of Whole Genomes Consortium
# function for modeling sample mutation rate
modelSampleRate <- function(IMDs) {
lambda <- log(2) / median(IMDs)
return(lambda)
}
# function for calculating the nth root of x
nthroot <- function(x, n) {
y <- x^(1 / n)
return(y)
}
# Function that defines the IMD cutoff specific for each segment
# Within this function you can use all variables available in the slots: genomicVariants and segments
IMDcutoffFun <- function(genomicVariants, segments) {
IMDs <- genomicVariants$IMD
totalVariants <- segments$totalVariants
width <- segments |>
dplyr::as_tibble() |>
dplyr::pull(width)
sampleRate <- modelSampleRate(IMDs)
IMDcutoff <- -log(1 - nthroot(0.01 / width, ifelse(totalVariants != 0, totalVariants - 1, 1))) / sampleRate
IMDcutoff <- replace(IMDcutoff, IMDcutoff > 1000, 1000)
return(IMDcutoff)
}
kdCustom <- detectKataegis(syntheticData, IMDcutoff = IMDcutoffFun)
The human autosomes and sex chromosomes from the reference genome
hg19 and hg38 come implemented in katdetectr
. However,
other (non standard) sequences can be analyzed if the correct arguments
are provided. You have to provide a dataframe that contains the length
of all the sequences you want to analyze.
# generate data that contains non standard sequences
syndata1 <- generateSyntheticData(seqnames = c("chr1_gl000191_random", "chr4_ctg9_hap1"))
syndata2 <- generateSyntheticData(seqnames = "chr1")
syndata <- suppressWarnings(c(syndata1, syndata2))
# construct a dataframe that contains the length of the sequences
# each column name (name of the sequence)
sequenceLength <- data.frame(
chr1 = 249250621,
chr1_gl000191_random = 106433,
chr4_ctg9_hap1 = 590426
)
# provide the dataframe with the sequence lengths using the refSeq argument
kdNonStandard <- detectKataegis(genomicVariants = syndata, refSeq = sequenceLength)
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] katdetectr_1.9.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] Rdpack_2.6.1 DBI_1.2.3
## [3] bitops_1.0-9 rlang_1.1.4
## [5] magrittr_2.0.3 matrixStats_1.4.1
## [7] compiler_4.4.1 RSQLite_2.3.7
## [9] GenomicFeatures_1.57.1 png_0.1-8
## [11] vctrs_0.6.5 stringr_1.5.1
## [13] pkgconfig_2.0.3 crayon_1.5.3
## [15] fastmap_1.2.0 backports_1.5.0
## [17] XVector_0.45.0 labeling_0.4.3
## [19] utf8_1.2.4 Rsamtools_2.21.2
## [21] rmarkdown_2.28 markdown_1.13
## [23] UCSC.utils_1.1.0 purrr_1.0.2
## [25] bit_4.5.0 xfun_0.48
## [27] BSgenome.Hsapiens.UCSC.hg38_1.4.5 zlibbioc_1.51.2
## [29] cachem_1.1.0 GenomeInfoDb_1.41.2
## [31] jsonlite_1.8.9 blob_1.2.4
## [33] highr_0.11 DelayedArray_0.33.1
## [35] BiocParallel_1.41.0 parallel_4.4.1
## [37] R6_2.5.1 plyranges_1.25.0
## [39] VariantAnnotation_1.51.2 stringi_1.8.4
## [41] bslib_0.8.0 RColorBrewer_1.1-3
## [43] rtracklayer_1.65.0 DNAcopy_1.81.0
## [45] GenomicRanges_1.57.2 jquerylib_0.1.4
## [47] Rcpp_1.0.13 SummarizedExperiment_1.35.5
## [49] knitr_1.48 zoo_1.8-12
## [51] IRanges_2.39.2 Matrix_1.7-1
## [53] splines_4.4.1 tidyselect_1.2.1
## [55] abind_1.4-8 yaml_2.3.10
## [57] ggtext_0.1.2 codetools_0.2-20
## [59] curl_5.2.3 lattice_0.22-6
## [61] tibble_3.2.1 Biobase_2.67.0
## [63] withr_3.0.2 KEGGREST_1.45.1
## [65] evaluate_1.0.1 survival_3.7-0
## [67] xml2_1.3.6 Biostrings_2.75.0
## [69] pillar_1.9.0 BiocManager_1.30.25
## [71] MatrixGenerics_1.17.1 checkmate_2.3.2
## [73] stats4_4.4.1 generics_0.1.3
## [75] RCurl_1.98-1.16 S4Vectors_0.43.2
## [77] ggplot2_3.5.1 commonmark_1.9.2
## [79] munsell_0.5.1 scales_1.3.0
## [81] glue_1.8.0 changepoint_2.2.4
## [83] maketools_1.3.1 tools_4.4.1
## [85] BiocIO_1.17.0 sys_3.4.3
## [87] data.table_1.16.2 BSgenome_1.75.0
## [89] GenomicAlignments_1.41.0 buildtools_1.0.0
## [91] maftools_2.21.3 XML_3.99-0.17
## [93] grid_4.4.1 tidyr_1.3.1
## [95] rbibutils_2.3 AnnotationDbi_1.69.0
## [97] colorspace_2.1-1 GenomeInfoDbData_1.2.13
## [99] restfulr_0.0.15 cli_3.6.3
## [101] fansi_1.0.6 S4Arrays_1.5.11
## [103] dplyr_1.1.4 gtable_0.3.6
## [105] BSgenome.Hsapiens.UCSC.hg19_1.4.3 sass_0.4.9
## [107] digest_0.6.37 BiocGenerics_0.53.0
## [109] SparseArray_1.5.45 farver_2.1.2
## [111] rjson_0.2.23 memoise_2.0.1
## [113] htmltools_0.5.8.1 lifecycle_1.0.4
## [115] httr_1.4.7 gridtext_0.1.5
## [117] bit64_4.5.2