Title: | Targeted sequencing panel design based on mutation hotspots |
---|---|
Description: | seq.hotSPOT provides a resource for designing effective sequencing panels to help improve mutation capture efficacy for ultradeep sequencing projects. Using SNV datasets, this package designs custom panels for any tissue of interest and identify the genomic regions likely to contain the most mutations. Establishing efficient targeted sequencing panels can allow researchers to study mutation burden in tissues at high depth without the economic burden of whole-exome or whole-genome sequencing. This tool was developed to make high-depth sequencing panels to study low-frequency clonal mutations in clinically normal and cancerous tissues. |
Authors: | Sydney Grant [aut, cre], Lei Wei [aut], Gyorgy Paragh [aut] |
Maintainer: | Sydney Grant <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.7.0 |
Built: | 2024-12-28 05:40:28 UTC |
Source: | https://github.com/bioc/seq.hotSPOT |
create a dataframe containing the coordinates of all potential amplicons for hotspot testing
amp_pool(data, amp)
amp_pool(data, amp)
data |
A dataframe containing the location of each mutation. |
amp |
The length of amplicons in number of base pairs |
This algorithm searches the mutational dataset (input) for mutational hotspot regions on each chromosome:
1. Starting at the mutation with the lowest chromosomal position (primary mutation), using a modified rank and recovery system, the algorithm searches for the closest neighboring mutation.
2. If the neighboring mutation is less than one amplicon, in distance, away from the primary mutation, the neighboring mutation is included within the hotspot region. a. This rank and recovery system is repeated, integrating mutations into the hotspot region until the neighboring mutation is greater than or equal to the length of one amplicon in distance, from the primary mutation. b. Once neighboring mutations equal or exceed one amplicon in distance from the primary mutation, incorporation into the hotspot region, halts incorporation.
3. For hotspots within the one amplicon range, from the lowest to highest mutation location, this area is covered by a single amplicon and added to an amplicon pool, with a unique ID. a. The center of these single amplicons is then defined by the weighted distribution of mutations.
4. For all hotspots larger than one amplicon, the algorithm examines 5 potential amplicons at each covered mutation in the hotspot: a. one amplicon directly upstream of the primary mutation b. one amplicon directly downstream of the primary mutation c. one amplicon including the mutation at the end of the read and base pairs (amplicon length - 1) upstream d. one amplicon including the mutation at the beginning of the read and base pairs (amplicon length - 1) downstream e. one amplicon with the mutation directly in the center.
5. All amplicons generated for each hotspot region of interest, are assigned a unique ID and added to the amplicon pool.
The mutation dataset should include two columns containing the chromosome and genomic position, the columns should be names "chr" and "pos" respectively. Optionally the gene names for each mutation may be included under a column names "gene".
A dataframe containing the genomic coordinates of all potential amplicons
data("mutation_data") amp_pool(mutation_data, 100)
data("mutation_data") amp_pool(mutation_data, 100)
create a targeted sequencing panel by finding which amplicons will likely capture the most mutations using a pseudo-exhaustive selection method
com_hotspot(fw_panel, bins, data, amp, len, size, include_genes)
com_hotspot(fw_panel, bins, data, amp, len, size, include_genes)
fw_panel |
a dataframe containing the sequencing panel designed by fw_hotspot |
bins |
A dataframe containing all potential amplicons |
data |
A dataframe containing the location of each mutation. |
amp |
The length of amplicons in number of base pairs |
len |
The total length of sequencing panel in number of base pairs |
size |
the threshold of hotspot size to split up in number of amplicons |
include_genes |
True or False based on whether dataset includes gene names |
Comprehensive Selection Sequencing Panel Identifier (Optimal mutation capture)
1. To conserve computational power, the forward selection sequencing panel identifier is run to determine the lowest number of mutations per amplicon (mutation frequency) that need to be included in the predetermined length sequencing panel. a. any amplicon generated by the algorithm, which is less than this threshold value, will be removed.
2. For the feasible exhaustive selection of amplicon combinations covering hotspot areas larger than the predefined number of amplicons in length, the algorithm breaks these large regions into multiple smaller regions. a. The amplicons covering these regions are pulled from the amplicon pool, based on their unique IDs.
3. The algorithm finds both the minimum number of amplicons overlap and all positions with this value and identifies the region with the longest continuous spot of minimum value. a. The region is split at the center of this longest continuous minimum post values and continues the splitting process until all smaller regions are less than the ānā number amplicon length set by the user. i. As this set number of amplicons decreases, the computation time required also often decreases.
4. All amplicons contained in these bins are added back to the amplicon pool, based on a new unique ID.
5. Amplicons covering hotspots less than or equal to one amplicon length are added to the final sequencing panel dataset.
6. To determine the optimal combination of amplicons for each region, the number of amplicons necessary for full coverage of the bin is calculated.
7. A list is generated of every possible combination of n, number of amplicons, needed. For each combination of amplicons: a. amplicons that would not meet the threshold of unique mutations are filtered out, and the number of all mutations captured by these amplicons is calculated. b. the combination of amplicons that yields the highest number of mutations is added to the final sequencing panel.
8. All amplicons in the final sequencing panel are ranked from highest to lowest based on the number of mutations they cover.
9. All amplicons capturing the number of mutations equal to the cutoff are further ranked to favor amplicons that have mutations closer in location to the center of the amplicon.
10. Cumulative base-pair length and cumulative mutations covered by each amplicon are calculated. a. Depending on the desired length of the targeted panel, a cutoff may be applied to remove all amplicons which fall below a set cumulative length.
A dataframe containing the genomic coordinates for targeted sequencing panel
data("mutation_data") my_bins <- amp_pool(mutation_data, 100) my_fw_panel <- fw_hotspot(my_bins, mutation_data, 100, 1000, TRUE) com_hotspot(my_fw_panel, my_bins, mutation_data, 100, 1000, 3, TRUE)
data("mutation_data") my_bins <- amp_pool(mutation_data, 100) my_fw_panel <- fw_hotspot(my_bins, mutation_data, 100, 1000, TRUE) com_hotspot(my_fw_panel, my_bins, mutation_data, 100, 1000, 3, TRUE)
create a targeted sequencing panel by finding which amplicons will likely capture the most mutations
fw_hotspot(bins, data, amp, len, include_genes)
fw_hotspot(bins, data, amp, len, include_genes)
bins |
A dataframe containing all potential amplicons |
data |
A dataframe containing the location of each mutation. |
amp |
The length of amplicons in number of base pairs |
len |
The total length of sequencing panel in number of base pairs |
include_genes |
True or False based on whether dataset includes gene names |
Forward Selection Sequencing Panel Identifier
1. Amplicons covering hotspots less than or equal to one amplicon in length, are added to the final sequencing panel dataset.
2. For amplicons covering larger hotspot regions, the algorithm uses a forward selection method to determine the optimal combination of amplicons to use in the sequencing panel: a. the algorithm first identifies the amplicon containing the highest number of mutations b. the algorithm then identifies the next amplicon, which contains the highest number of new mutations. c. this process continues until all mutations are covered by at least one amplicon
3. Each of these amplicons are then added to the final sequencing panel, with their own unique IDs.
4. All amplicons in the final sequencing panel are ranked from highest to lowest based on the number of mutations they cover.
5. The algorithm then calculates the cumulative base-pair length and the cumulative mutations covered by each amplicon.
6. Dependent on the desired length of the targeted panel, a cutoff may be applied to remove all amplicons which fall below a set cumulative length.
A dataframe containing the genomic coordinates for targeted sequencing panel
data("mutation_data") my_bins <- amp_pool(mutation_data, 100) fw_hotspot(my_bins, mutation_data, 100, 1000, TRUE)
data("mutation_data") my_bins <- amp_pool(mutation_data, 100) fw_hotspot(my_bins, mutation_data, 100, 1000, TRUE)
dataframe containing the chromosome and base pair position from single nucleotide variants of ultradeep sequencing epidermis.
mutation_data
mutation_data
## 'mutation_data' a dataframe with 3 columns and 201 rows:
Chromosome which the mutation is located on
Base pair position of mutation
Name of gene affected by mutation
mutation_data
A dataframe containing the chromosome number, base pair location and optional gene name of mutations
Wei L, Christensen SR, Fitzgerald ME, Graham J, Hutson ND, Zhang C, Huang Z, Hu Q, Zhan F, Xie J, Zhang J, Liu S, Remenyik E, Gellen E, Colegio OR, Bax M, Xu J, Lin H, Huss WJ, Foster BA, Paragh G. Ultradeep sequencing differentiates patterns of skin clonal mutations associated with sun-exposure status and skin cancer burden. Sci Adv. 2021 Jan 1;7(1):eabd7703. doi: 10.1126/sciadv.abd7703. PMID: 33523857; PMCID: PMC7775785.
seq.hotSPOT provides a resource for designing effective sequencing panels to help improve mutation capture efficacy for ultradeep sequencing projects. Using SNV datasets, this package designs custom panels for any tissue of interest and identify the genomic regions likely to contain the most mutations. Establishing efficienttargeted sequencing panels can allow researchers to study mutation burden in tissues at high depth without the economic burden of whole-exome or whole-genome sequencing. This tool was developed to make high-depth sequencing panels to study low-frequency clonal mutations in clinically normal and cancerous tissues.
A package containing functions which generate optimal targeted sequencing panels based on mutation hotspots.