---
title: "Introduction to G4SNVHunter"
author:
- name: Rongxin Zhang
affiliation:
- "State Key Laboratory of Digital Medical Engineering,
Southeast University, China"
- "Laboratoire d’Optique et Biosciences (LOB), Ecole Polytechnique, France"
email: rongxinzhang@outlook.com
package: G4SNVHunter
date: "`r Sys.Date()`"
output:
BiocStyle::html_document
bibliography: references.bib
vignette: >
%\VignetteIndexEntry{Introduction to G4SNVHunter}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Introduction
G-quadruplexes (G4s) are non-canonical secondary structures that can form in
guanine-rich DNA or RNA regions [@varshney2020regulation]. These structures are
significantly enriched in functional regions like gene promoters
[@huppert2007g] and telomeres [@phan2002human], and their regulatory roles are
closely linked to their structures rather than their sequences [@zhang2024g].
Single nucleotide variants (SNVs) are a common type of genetic variation. If
such a variant occurs within a G4 region, it may affect the formation of that
G4 structure.
`G4SNVHunter` is designed to evaluate the potential impact of SNVs on G4
formation, but it can also be used to assess single nucleotide polymorphisms
(SNPs). This package leverages the highly effective G4 structure prediction
algorithm, G4Hunter [@bedrat2016re], which was conceptualized by Mergny et al.
[@bedrat2016re]. One of G4Hunter's notable advantages is that it will evaluate
the propensity of G4 structures within a specified window, effectively taking
into account the influence of flanking sequences on G4 formation, not just the
G4 sequence itself.
`G4SNVHunter` only requires users to provide genomic sequence data
(`DNAStringSet` object) and substitution information of variants
(`GRanges` object) to quickly determine which G4 structures may be affected by
SNVs (or SNPs). We simplify the analysis process by integrating complex
functionality into a simple function that can be used even by those
inexperienced in G4 research or basic programming skills. Users can
subsequently design *‘wet’* experiments based on the results of `G4SNVHunter`
to further explore the mechanisms by which variants affect biological
regulatory functions through their impact on G4 structures.
# Installation
As the first step, you need to install `G4SNVHunter`, which can be done with a
simple command,
please ensure that your network connection is stable.
```{r install_G4SNVHunter, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Currently, G4SNVHunter depends on the devel version of Bioconductor.
# Note: Bioconductor devel requires the latest R version (4.5 or higher).
# Or, you can install G4SNVHunter from GitHub instead.
BiocManager::install(version='devel')
# Install the G4SNVHunter package and its dependencies for this tutorial.
BiocManager::install("G4SNVHunter", dependencies = TRUE)
```
Then load `G4SNVHunter` to access its functions.
```{r load_G4SNVHunter, message = FALSE}
library(G4SNVHunter)
```
# Load packages required for this tutorial
During this tutorial, we might need to use a few additional packages.
Since we specified `dependencies = TRUE` when installing `G4SNVHunter` package,
these additional packages have already been installed.
We can load them directly.
```{r load_required_pkg, message = FALSE, warning = FALSE}
library(BSgenome.Hsapiens.UCSC.hg19)
library(GenomicRanges)
library(DT)
library(rtracklayer)
```
# Input data
`G4SNVHunter` requires you to provide two types of data:
* the genomic sequences (a `DNAStringSet` object)
* the set of SNVs or SNPs (a `GRanges` object)
While the input formats for these data are quite flexible, they must ultimately
be converted into the appropriate formats: `DNAStringSet` for sequences and
`GRanges` for SNVs.
Detailed descriptions on how to prepare and format these inputs are provided
below.
## Genomic sequences
The genomic sequences refer to the sequence of the chromosome or fragment where
your SNVs (or SNPs) are located. They can be complete chromosome sequences or
large fragments extracted from chromosomes.
We will predict G4 structures from the sequences you provide and analyze
whether these SNVs can affect their formation.
Please note that the coordinates of your SNVs must be relative to the genomic
sequence you provide, and they should be in `1-based` coordinates.
The genomic sequences need to be processed into a `DNAStringSet` object. We
provide a built-in function, `loadSequence`, to facilitate this processing. Of
course, you can also load your customized sequence file and convert it into
a `DNAStringSet` object without using `loadSequence` function.
`loadSequence` mainly accepts three types of input:
* A two-column `data.frame`, with the first column containing the sequence
identifiers and the second column containing their corresponding sequences.
This should be specified using the `genome_seq` parameter.
* The path to a stored FASTA file. The FASTA file must have a `.fa`, `.fna`,
or `.fasta` extension. This should be specified using the `seq_path`
parameter.
* A text file (`.txt`) that stores the sequence identifiers and their
corresponding sequences. The first column should contain the sequence
identifiers, and the second column should contain the sequences. This should
also be specified using the `seq_path` parameter.
Note: Please do not include column names in the file!
Here are some examples for you to load sequences into a `DNAStringSet` object
using `loadSequence` function:
Load from a `data.frame` object
```{r load_seq_from_df_using_loadSequence}
seq_df <- data.frame(chr = c("seq1", "seq2"),
sequence = c(paste0(rep("G", 100), collapse = ""),
paste0(rep("A", 100), collapse = "")))
seq <- loadSequence(genome_seq = seq_df)
```
Load from a `fasta` file
```{r load_seq_from_fa_using_loadSequence}
# File path to the sequences in fasta format
fa_path <- system.file("extdata", "seq.fa", package = "G4SNVHunter")
seq <- loadSequence(seq_path = fa_path)
```
Load from a `txt` file
```{r load_seq_from_txt_using_loadSequence}
# File path to the sequences in txt format
txt_path <- system.file("extdata", "seq.txt", package = "G4SNVHunter")
seq <- loadSequence(seq_path = txt_path)
```
We can also obtain genome sequences for some common species from
[Bioconductor Annotation Packages](
https://www.bioconductor.org/packages/release/data/annotation/).
While this is convenient, it requires you to install some related packages in
advance.
For example:
```{r install_hg19_refseq, message = FALSE, warning = FALSE}
# Load sequence for chromosome 21 (hg19)
hg19 <- BSgenome.Hsapiens.UCSC.hg19
chr21_seq <- DNAStringSet(hg19$chr21)
# Chromosome names are needed for analysis
names(chr21_seq) <- "chr21"
```
## SNV data
The SNV (or SNP) data needs to be processed into a `GRanges` object. Since the
form of SNV data is too flexible, we do not provide a function to load SNVs
here. However, you can easily load and convert SNV data into a `GRanges` object
yourself.
For example,
```{r load_SNV, message = FALSE}
# Path to SNPs
snp_path <- system.file("extdata", "snp.txt", package = "G4SNVHunter")
# Load SNPs into memory
snp <- read.table(snp_path, sep = "\t", header = FALSE)
# Convert snp to GRanges
snp <- GRanges(seqnames = snp$V1,
ranges = IRanges(start = snp$V2, width = 1),
rsid = snp$V3,
ref = snp$V4,
alt = snp$V5)
print(snp)
```
## SNV data validation
We provide a built-in `checkSNV` function for you to verify whether your SNV
(or SNP) data meets the requirements of our program.
However, it is not necessary to run this function explicitly, as our program
will automatically perform the checks at the appropriate stages. Nevertheless,
performing a proactive check is always a good practice.
```{r check_SNVs}
gr <- GRanges(
seqnames = Rle("seq1"),
ranges = IRanges(c(100, 200, 300), width=1),
ref = c("A", "C", "G"),
alt = c("T", "T", "A")
)
# Check width ('w'), ref ('r'), and alt ('a')
# Returns TRUE if it passed the validation
checkSNV(gr, mode = "wra", ref_col = "ref", alt_col = "alt")
```
# Predict G4s
## Leverage G4Hunter for G4 detection
You can use the `G4HunterDetect` function to directly predict G4s from a
`DNAStringSet` object containing sequences based on G4Hunter algorithm.
For example,
```{r predict_G4s}
# Sequence file in fasta file format
fa_path <- system.file("extdata", "seq.fa", package = "G4SNVHunter")
# Load sequences
seq <- loadSequence(seq_path = fa_path)
# Predict G4s
G4_detected <- G4HunterDetect(seq)
```
Then, we can examine the prediction results, which are stored as a `GRanges`
object.
You can use functions like `print` to directly view the results. However, in
this instance, we'll leverage the [`datatable`](https://rstudio.github.io/DT/)
function to view the predicted G4s, as this method provides a more
user-friendly display.
Let's take a look at the G4s,
```{r print_G4s, warning = FALSE, message = FALSE}
datatable(as.data.frame(G4_detected), options = list(scrollX = TRUE))
```
This function will return a `GRanges` object containing all potential G4s
predicted by the G4Hunter algorithms. It mainly includes:
* The sequence identifier where the G4 is located (`seqnames`),
* The position of the G4 relative to the sequence you provided
(`ranges`, `1-based` coordinate),
* The strand on which the G4 is located (`strand`), with `+` indicating the G4
is on the positive strand, and `-` indicating it is on the negative strand,
* The G4Hunter score for the G4 (`score`),
* The maximum G4Hunter score of the windows covering that G4 during the
execution of the G4Hunter algorithm
(`max_score`, used to determine if the G4 surpasses the threshold),
* The predicted G4 sequence on the positive strand (`sequence`, with C-rich
means G4s on the reverse strand).
Please note that `score` reflects the precise score of the G4 sequence, while
`max_score` represents the highest score obtained by sliding a window across
that G4.
See the illustration below.
```{r G4_score_maxscore, echo=FALSE}
knitr::include_graphics("images/G4SNVHunter_01.svg")
```
Some parameters for prediction can be customized.
For example,
```{r predict_G4s_parameters}
# Predict G4 by customizing other parameters
G4_detected <- G4HunterDetect(seq, threshold = 1.5, window_size = 20)
```
* `threshold`: G4Hunter will search for G4s in windows above this threshold
(absolute value). Default is `1.5`. Unless there are special needs, we do not
recommend setting the threshold below `1.2`.
* `window_size`: The window size (bp) for G4Hunter prediction.
Default is `25`.
Another commonly used window size is `20`. However, `25` is generally
preferred unless otherwise required.
* `include_sequences`: Whether to include the predicted G4 sequences in the
output. Default is `TRUE`.
* `strands`: Indicate which strand requires prediction, with `b` for both
strands and `p` for positive strand only. Please note that if your genome is
single-stranded, this parameter should be set to `p` to prevent the program
from predicting G4s on a non-existent negative strand.
We generally do not recommend modifying certain parameters, such as
`window_size` and `threshold`, as their default settings are already optimal.
## Export prediction results
Since we store the prediction results as a `GRanges` object, exporting is very
straightforward.
For example, you can export it into a CSV file,
```{r export_G4_as_csv, eval = FALSE}
# export as csv format
write.csv(as.data.frame(G4_detected), "/path/to/your/file.csv")
```
or export it into other formats as well.
```{r export_G4_as_other_formats, eval = FALSE}
# export as bed format
export(G4_detected, "/path/to/your/file.bed", format = "bed")
# export as bigwig format
export(G4_detected, "/path/to/your/file.bw", format = "bigWig")
```
## Visualize G4Hunter prediction results
You can use `plotG4Info` function for visualizing the statistical information
of G4s predicted by the `G4HunterDetect` function.
```{r plot_G4, message = FALSE, fig.width = 8.5, fig.height = 8}
plotG4Info(G4_detected)
```
The left panels (`A`, `C`, `E`) show the overall distribution of the
`max_score` (absolute value), `score` (absolute value), and length of G4s,
respectively. The right panels (`B`, `D`, `F`) illustrate the distribution of
`max_score`, `score`, and length for G4 structures on both the positive and
negative strands. The scores for G4s on the positive strand are positive, while
those on the negative strand are negative, resulting in a bimodal distribution
of `max_score` and `score`. Please note that the positive or negative score
reflects only the type of strand the G4 is on, whereas the capability of the G4
structure to form is determined by the absolute value of the score.
# Evaluate SNV effects
We provide two modes in `SNVImpactG4` function for you to assess the potential
impact of SNVs on G4s:
* Variant-centric assessment (`s` mode)
* Sample-centric assessment (`m` mode)
In short, variant-centric assessment considers the impact of each SNV on the G4
sequence individually, while sample-centric assessment combines the effects of
multiple SNVs on a particular G4 for each sample.
See the illustration below.
```{r SNV_effect_interpretation, echo = FALSE}
knitr::include_graphics("images/G4SNVHunter_02.svg")
```
We have prepared example data with variants from chromosome 21 of the human
genome (hg19), and you can easily load them,
```{r load_example_gr_data}
data(snp_gr)
data(snv_gr)
```
Additionally, you can quickly and conveniently predict G4 sequences on
chromosome 21 using the `G4HunterDetect` function.
```{r predict_chr21_G4s}
# Predict the G4s in human chr 21 (hg19)
chr21_G4 <- G4HunterDetect(chr21_seq)
```
## Variant-centric assessment
The default mode of `SNVImpactG4` is variant-centric (`mode = 's'`), which
requires only the G4 data in `GRanges` format (as returned by the
`G4HunterDetect` function), the SNV data in `GRanges` format, and the column
name for the alternate bases.
Then, the assessment can be done easily by,
```{r SNV_effects_s_mode, message = FALSE}
snp_eff <- SNVImpactG4(chr21_G4,
snp_gr,
alt_col = "alt")
```
Let's view the first three records,
```{r SNV_effects_s_mode_header3}
datatable(as.data.frame(snp_eff[1:3]), options = list(scrollX = TRUE))
```
`SNVImpactG4` function will return a `GRanges` object that includes detailed
information about the SNVs (core columns and `SNV.info.*` columns), their
associated G4 regions (`G4.info.*` columns), and the changes in G4 sequences
(`mut.G4.seq` and `mut.G4.anno.seq` columns) and G4Hunter scores (`mut.score`
and `score.diff` column).
Please note that SNVs located outside of G4 regions will be automatically
ignored.
## Sample-centric assessment
You can set the `mode` parameter in `SNVImpactG4` function to `'m'` to enable
sample-centric mode. This mode is particularly useful for specific scenarios,
such as analyzing SNVs derived from cancer patients.
When using this mode, you must also specify the column names for sample IDs
(`sampleid_col`) and SNV IDs (`snvid_col`).
```{r SNV_effects_m_mode, message = FALSE}
# Column names of the Sample ID and SNV ID must be specified
snv_eff <- SNVImpactG4(chr21_G4,
snv_gr,
alt_col = "alt",
mode = "m",
sampleid_col = "sampleid",
snvid_col = "snv_id")
```
Let’s view the first three records,
```{r SNV_effects_m_mode_header3}
datatable(as.data.frame(snv_eff[1:3]), options = list(scrollX = TRUE))
```
It will return a `GRanges` object containing detailed information about the G4s
(core columns and `G4.info.*` columns). The object will also include the IDs of
the SNVs (`snv.ids` column) located in the G4 regions, the sample IDs
(`sample.ids` column), the mutated G4 sequences (`mut.G4.seq` and
`mut.G4.anno.seq` columns), and the changes in G4Hunter scores (`mut.score` and
`score.diff` columns).
Please note that in sample-centric mode, when multiple SNVs from the same
sample occur within a single G4 region, we will consider the combined impact of
these SNVs. For example, the following G4 overlaps with two SNVs
(`id_3608` and `id_49857`) from sample `GQ3Y`, so the final G4Hunter score is
based on the combined effect of these two SNVs.
```{r SNV_effects_m_mode_example}
datatable(as.data.frame(snv_eff[528]), options = list(scrollX = TRUE))
```
## Filtering potentially affected G4 sequences
Given that some SNVs may have minimal effects on G4 formation, we need to
filter out those SNVs that are worth investigating further for experimental
validation or additional analysis. In other words, identifying those variants
that have the potential to disrupt the G4 structures.
We provide a convenient function, `filterSNVImpact`, to help with this
filtering process.
There are three threshold parameters for you to adjust: `raw_score_threshold`,
`mut_score_threshold`, and `score_diff_threshold`.
If `raw_score_threshold` (a positive number, but no more than 4) is specified,
`filterSNVImpact` will filter out records where the absolute value of the
original G4Hunter score is below this threshold.
If `mut_score_threshold` (a positive number, but no more than 4) is specified,
`filterSNVImpact` will retain records where the G4Hunter score of the mutated
G4 sequences does not exceed this threshold.
If `score_diff_threshold` (a negative number, but no less than -4) is
specified, `filterSNVImpact` will retain records where the decrease in the
G4Hunter score after mutation exceeds this threshold.
For example, if `score_diff_threshold = -0.2`, those records will be retained:
\( \left| \text{G4HunterScore}_{\text{mut_seq}} \right| - \left|
\text{G4HunterScore}_{\text{raw_seq}} \right| \leq -0.2 \)
Our recommendation is that `raw_score_threshold` should be greater than 1.5,
and the `mut_score_threshold` should be less than 1.2. This is an empirically
based guideline, and you are certainly free to adjust them to be more stringent
or to use these parameters for flexible customization as needed.
Please note that you must specify at least one threshold parameter, but you do
not need to specify all of them.
For example, using the results returned in mode `m`
```{r filter_SNV_impact}
filtered_snv_eff <- filterSNVImpact(snv_eff,
mut_score_threshold = 1.2,
score_diff_threshold = -0.35)
```
We can examine the filtered records to identify SNVs that have a substantial
impact on G4 formation.
```{r view_SNV_significant_impact}
datatable(as.data.frame(filtered_snv_eff), options = list(scrollX = TRUE))
```
The final results can be saved to a file similarly to how G4 prediction results
are saved, as described in [Section 5.2](#export-prediction-results).
## Visualize changes in G4Hunter scores
You can use the `plotSNVImpact` function we provide to visualize changes in
G4Hunter scores. This function is suitable for both variant-centric and
sample-centric output data, as well as their filtered outputs.
We can observe that mutated G4 structures generally exhibit a trend of
decreased formation potential.
```{r eff_plot, fig.height = 3, message = FALSE}
plotSNVImpact(snv_eff)
```
Let’s `examine` how the G4Hunter scores have changed in the `filtered_snv_eff`
object.
We observe that the scores for the filtered G4s have decreased significantly.
```{r eff_plot2, fig.height = 3, message = FALSE}
plotSNVImpact(filtered_snv_eff)
```
## Plot the seqlogo for mutated G4 sequences
Finally, we can use `plotImpactSeq` to visualize the G4 sequences affected by
SNVs.
For example, we can plot the sequence logos for the top five G4s with the most
significant changes in their G4Hunter scores.
```{r seqlogo_plot, fig.height = 8, fig.width = 10, message = FALSE}
top5_snv_eff <- filtered_snv_eff[order(filtered_snv_eff$score.diff)[1:5]]
plotImpactSeq(top5_snv_eff, ncol = 2)
```
# A coherent example
Let's explore a coherent example to predict the potential impact of SNVs on G4
formation.
Suppose you already have the sequences saved as a `DNAStringset` object
(`chr21_seq`) and the SNVs saved as a `GRanges` object (`snv_gr`), then execute
the following code,
```{r coherent_example, eval = FALSE}
# First step, predict G4s
chr21_G4 <- G4HunterDetect(chr21_seq)
# Second step, evaluate the impact of SNVs on G4s in 's' mode
snv_eff <- SNVImpactG4(chr21_G4, snv_gr, alt_col = "alt")
# Filter the results under default parameters
filtered_snv_eff <- filterSNVImpact(snv_eff, mut_score_threshold = 1.2)
# export as csv format
write.csv(as.data.frame(filtered_snv_eff), "/path/to/your/file.csv")
```
# Acknowledgements
The author would like to thank Dr. [Wenyong Zhu](
https://orcid.org/0009-0001-0707-135X) and Dr. [Xiao Sun](
https://orcid.org/0000-0003-1048-7775) from Southeast University for their
contributions: Dr. Zhu for providing the illustrations and testing the package,
and Dr. Sun for reviewing this Vignettes.
# Session Info
```{r session_info}
sessionInfo()
```
# References