G-quadruplexes (G4s) are non-canonical secondary structures that can form in guanine-rich DNA or RNA regions (Varshney et al. 2020). These structures are significantly enriched in functional regions like gene promoters (Huppert and Balasubramanian 2007) and telomeres (Phan and Mergny 2002), and their regulatory roles are closely linked to their structures rather than their sequences (Zhang et al. 2024).
Single nucleotide variants (SNVs) are a common type of genetic variation. If such a variant occurs within a G4 region, it may affect the formation of that G4 structure.
G4SNVHunter
is designed to evaluate the potential impact
of SNVs on G4 formation, but it can also be used to assess single
nucleotide polymorphisms (SNPs). This package leverages the highly
effective G4 structure prediction algorithm, G4Hunter (Bedrat, Lacroix, and Mergny 2016), which was
conceptualized by Mergny et al. (Bedrat,
Lacroix, and Mergny 2016). One of G4Hunter’s notable advantages
is that it will evaluate the propensity of G4 structures within a
specified window, effectively taking into account the influence of
flanking sequences on G4 formation, not just the G4 sequence itself.
G4SNVHunter
only requires users to provide genomic
sequence data (DNAStringSet
object) and substitution
information of variants (GRanges
object) to quickly
determine which G4 structures may be affected by SNVs (or SNPs). We
simplify the analysis process by integrating complex functionality into
a simple function that can be used even by those inexperienced in G4
research or basic programming skills. Users can subsequently design
‘wet’ experiments based on the results of
G4SNVHunter
to further explore the mechanisms by which
variants affect biological regulatory functions through their impact on
G4 structures.
As the first step, you need to install G4SNVHunter
,
which can be done with a simple command, please ensure that your network
connection is stable.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# Currently, G4SNVHunter depends on the devel version of Bioconductor.
# Note: Bioconductor devel requires the latest R version (4.5 or higher).
# Or, you can install G4SNVHunter from GitHub instead.
BiocManager::install(version='devel')
# Install the G4SNVHunter package and its dependencies for this tutorial.
BiocManager::install("G4SNVHunter", dependencies = TRUE)
Then load G4SNVHunter
to access its functions.
During this tutorial, we might need to use a few additional packages.
Since we specified dependencies = TRUE
when installing
G4SNVHunter
package, these additional packages have already
been installed.
We can load them directly.
G4SNVHunter
requires you to provide two types of
data:
the genomic sequences (a DNAStringSet
object)
the set of SNVs or SNPs (a GRanges
object)
While the input formats for these data are quite flexible, they must
ultimately be converted into the appropriate formats:
DNAStringSet
for sequences and GRanges
for
SNVs.
Detailed descriptions on how to prepare and format these inputs are provided below.
The genomic sequences refer to the sequence of the chromosome or fragment where your SNVs (or SNPs) are located. They can be complete chromosome sequences or large fragments extracted from chromosomes. We will predict G4 structures from the sequences you provide and analyze whether these SNVs can affect their formation.
Please note that the coordinates of your SNVs must be relative to the
genomic sequence you provide, and they should be in 1-based
coordinates.
The genomic sequences need to be processed into a
DNAStringSet
object. We provide a built-in function,
loadSequence
, to facilitate this processing. Of course, you
can also load your customized sequence file and convert it into a
DNAStringSet
object without using loadSequence
function.
loadSequence
mainly accepts three types of input:
A two-column data.frame
, with the first column
containing the sequence identifiers and the second column containing
their corresponding sequences. This should be specified using the
genome_seq
parameter.
The path to a stored FASTA file. The FASTA file must have a
.fa
, .fna
, or .fasta
extension.
This should be specified using the seq_path
parameter.
A text file (.txt
) that stores the sequence
identifiers and their corresponding sequences. The first column should
contain the sequence identifiers, and the second column should contain
the sequences. This should also be specified using the
seq_path
parameter.
Note: Please do not include column names in the file!
Here are some examples for you to load sequences into a
DNAStringSet
object using loadSequence
function:
Load from a data.frame
object
seq_df <- data.frame(chr = c("seq1", "seq2"),
sequence = c(paste0(rep("G", 100), collapse = ""),
paste0(rep("A", 100), collapse = "")))
seq <- loadSequence(genome_seq = seq_df)
Load from a fasta
file
# File path to the sequences in fasta format
fa_path <- system.file("extdata", "seq.fa", package = "G4SNVHunter")
seq <- loadSequence(seq_path = fa_path)
Load from a txt
file
# File path to the sequences in txt format
txt_path <- system.file("extdata", "seq.txt", package = "G4SNVHunter")
seq <- loadSequence(seq_path = txt_path)
We can also obtain genome sequences for some common species from Bioconductor Annotation Packages. While this is convenient, it requires you to install some related packages in advance. For example:
The SNV (or SNP) data needs to be processed into a
GRanges
object. Since the form of SNV data is too flexible,
we do not provide a function to load SNVs here. However, you can easily
load and convert SNV data into a GRanges
object yourself.
For example,
# Path to SNPs
snp_path <- system.file("extdata", "snp.txt", package = "G4SNVHunter")
# Load SNPs into memory
snp <- read.table(snp_path, sep = "\t", header = FALSE)
# Convert snp to GRanges
snp <- GRanges(seqnames = snp$V1,
ranges = IRanges(start = snp$V2, width = 1),
rsid = snp$V3,
ref = snp$V4,
alt = snp$V5)
print(snp)
## GRanges object with 15 ranges and 3 metadata columns:
## seqnames ranges strand | rsid ref alt
## <Rle> <IRanges> <Rle> | <character> <character> <character>
## [1] chr21 15397335 * | rs1244782663 G A
## [2] chr21 39931215 * | rs910271214 G A
## [3] chr21 22915361 * | rs1210166884 A G
## [4] chr21 42644592 * | rs554035836 A G
## [5] chr21 10743898 * | rs955290928 A G
## ... ... ... ... . ... ... ...
## [11] chr21 30836340 * | rs1212424980 T C
## [12] chr21 36426864 * | rs1478442431 G T
## [13] chr21 27584238 * | rs1191559536 T C
## [14] chr21 36106373 * | rs561521189 C T
## [15] chr21 21840835 * | rs1301205088 T C
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We provide a built-in checkSNV
function for you to
verify whether your SNV (or SNP) data meets the requirements of our
program.
However, it is not necessary to run this function explicitly, as our program will automatically perform the checks at the appropriate stages. Nevertheless, performing a proactive check is always a good practice.
gr <- GRanges(
seqnames = Rle("seq1"),
ranges = IRanges(c(100, 200, 300), width=1),
ref = c("A", "C", "G"),
alt = c("T", "T", "A")
)
# Check width ('w'), ref ('r'), and alt ('a')
# Returns TRUE if it passed the validation
checkSNV(gr, mode = "wra", ref_col = "ref", alt_col = "alt")
## [1] TRUE
You can use the G4HunterDetect
function to directly
predict G4s from a DNAStringSet
object containing sequences
based on G4Hunter algorithm.
For example,
# Sequence file in fasta file format
fa_path <- system.file("extdata", "seq.fa", package = "G4SNVHunter")
# Load sequences
seq <- loadSequence(seq_path = fa_path)
# Predict G4s
G4_detected <- G4HunterDetect(seq)
## Start processing...
## Merging predicted G4s...
## Done!
Then, we can examine the prediction results, which are stored as a
GRanges
object.
You can use functions like print
to directly view the
results. However, in this instance, we’ll leverage the datatable
function
to view the predicted G4s, as this method provides a more user-friendly
display.
Let’s take a look at the G4s,
This function will return a GRanges
object containing
all potential G4s predicted by the G4Hunter algorithms. It mainly
includes:
The sequence identifier where the G4 is located
(seqnames
),
The position of the G4 relative to the sequence you provided
(ranges
, 1-based
coordinate),
The strand on which the G4 is located (strand
), with
+
indicating the G4 is on the positive strand, and
-
indicating it is on the negative strand,
The G4Hunter score for the G4 (score
),
The maximum G4Hunter score of the windows covering that G4 during
the execution of the G4Hunter algorithm (max_score
, used to
determine if the G4 surpasses the threshold),
The predicted G4 sequence on the positive strand
(sequence
, with C-rich means G4s on the reverse
strand).
Please note that score
reflects the precise score of the
G4 sequence, while max_score
represents the highest score
obtained by sliding a window across that G4.
See the illustration below.
Some parameters for prediction can be customized. For example,
# Predict G4 by customizing other parameters
G4_detected <- G4HunterDetect(seq, threshold = 1.5, window_size = 20)
## Start processing...
## Merging predicted G4s...
## Done!
threshold
: G4Hunter will search for G4s in windows
above this threshold (absolute value). Default is 1.5
.
Unless there are special needs, we do not recommend setting the
threshold below 1.2
.
window_size
: The window size (bp) for G4Hunter
prediction. Default is 25
. Another commonly used window
size is 20
. However, 25
is generally preferred
unless otherwise required.
include_sequences
: Whether to include the predicted
G4 sequences in the output. Default is TRUE
.
strands
: Indicate which strand requires prediction,
with b
for both strands and p
for positive
strand only. Please note that if your genome is single-stranded, this
parameter should be set to p
to prevent the program from
predicting G4s on a non-existent negative strand.
We generally do not recommend modifying certain parameters, such as
window_size
and threshold
, as their default
settings are already optimal.
Since we store the prediction results as a GRanges
object, exporting is very straightforward.
For example, you can export it into a CSV file,
or export it into other formats as well.
You can use plotG4Info
function for visualizing the
statistical information of G4s predicted by the
G4HunterDetect
function.
The left panels (A
, C
, E
) show
the overall distribution of the max_score
(absolute value),
score
(absolute value), and length of G4s, respectively.
The right panels (B
, D
, F
)
illustrate the distribution of max_score
,
score
, and length for G4 structures on both the positive
and negative strands. The scores for G4s on the positive strand are
positive, while those on the negative strand are negative, resulting in
a bimodal distribution of max_score
and score
.
Please note that the positive or negative score reflects only the type
of strand the G4 is on, whereas the capability of the G4 structure to
form is determined by the absolute value of the score.
We provide two modes in SNVImpactG4
function for you to
assess the potential impact of SNVs on G4s:
Variant-centric assessment (s
mode)
Sample-centric assessment (m
mode)
In short, variant-centric assessment considers the impact of each SNV on the G4 sequence individually, while sample-centric assessment combines the effects of multiple SNVs on a particular G4 for each sample.
See the illustration below.
We have prepared example data with variants from chromosome 21 of the human genome (hg19), and you can easily load them,
Additionally, you can quickly and conveniently predict G4 sequences
on chromosome 21 using the G4HunterDetect
function.
## Start processing...
## Now process: chr21.
## Merging predicted G4s...
## Done!
The default mode of SNVImpactG4
is variant-centric
(mode = 's'
), which requires only the G4 data in
GRanges
format (as returned by the
G4HunterDetect
function), the SNV data in
GRanges
format, and the column name for the alternate
bases.
Then, the assessment can be done easily by,
Let’s view the first three records,
SNVImpactG4
function will return a GRanges
object that includes detailed information about the SNVs (core columns
and SNV.info.*
columns), their associated G4 regions
(G4.info.*
columns), and the changes in G4 sequences
(mut.G4.seq
and mut.G4.anno.seq
columns) and
G4Hunter scores (mut.score
and score.diff
column).
Please note that SNVs located outside of G4 regions will be automatically ignored.
You can set the mode
parameter in
SNVImpactG4
function to 'm'
to enable
sample-centric mode. This mode is particularly useful for specific
scenarios, such as analyzing SNVs derived from cancer patients.
When using this mode, you must also specify the column names for
sample IDs (sampleid_col
) and SNV IDs
(snvid_col
).
# Column names of the Sample ID and SNV ID must be specified
snv_eff <- SNVImpactG4(chr21_G4,
snv_gr,
alt_col = "alt",
mode = "m",
sampleid_col = "sampleid",
snvid_col = "snv_id")
Let’s view the first three records,
It will return a GRanges
object containing detailed
information about the G4s (core columns and G4.info.*
columns). The object will also include the IDs of the SNVs
(snv.ids
column) located in the G4 regions, the sample IDs
(sample.ids
column), the mutated G4 sequences
(mut.G4.seq
and mut.G4.anno.seq
columns), and
the changes in G4Hunter scores (mut.score
and
score.diff
columns).
Please note that in sample-centric mode, when multiple SNVs from the
same sample occur within a single G4 region, we will consider the
combined impact of these SNVs. For example, the following G4 overlaps
with two SNVs (id_3608
and id_49857
) from
sample GQ3Y
, so the final G4Hunter score is based on the
combined effect of these two SNVs.
Given that some SNVs may have minimal effects on G4 formation, we need to filter out those SNVs that are worth investigating further for experimental validation or additional analysis. In other words, identifying those variants that have the potential to disrupt the G4 structures.
We provide a convenient function, filterSNVImpact
, to
help with this filtering process.
There are three threshold parameters for you to adjust:
raw_score_threshold
, mut_score_threshold
, and
score_diff_threshold
.
If raw_score_threshold
(a positive number, but no more
than 4) is specified, filterSNVImpact
will filter out
records where the absolute value of the original G4Hunter score is below
this threshold.
If mut_score_threshold
(a positive number, but no more
than 4) is specified, filterSNVImpact
will retain records
where the G4Hunter score of the mutated G4 sequences does not exceed
this threshold.
If score_diff_threshold
(a negative number, but no less
than -4) is specified, filterSNVImpact
will retain records
where the decrease in the G4Hunter score after mutation exceeds this
threshold. For example, if score_diff_threshold = -0.2
,
those records will be retained: |G4HunterScoremut_seq| − |G4HunterScoreraw_seq| ≤ −0.2
Our recommendation is that raw_score_threshold
should be
greater than 1.5, and the mut_score_threshold
should be
less than 1.2. This is an empirically based guideline, and you are
certainly free to adjust them to be more stringent or to use these
parameters for flexible customization as needed.
Please note that you must specify at least one threshold parameter, but you do not need to specify all of them.
For example, using the results returned in mode m
filtered_snv_eff <- filterSNVImpact(snv_eff,
mut_score_threshold = 1.2,
score_diff_threshold = -0.35)
We can examine the filtered records to identify SNVs that have a substantial impact on G4 formation.
The final results can be saved to a file similarly to how G4 prediction results are saved, as described in Section 5.2.
You can use the plotSNVImpact
function we provide to
visualize changes in G4Hunter scores. This function is suitable for both
variant-centric and sample-centric output data, as well as their
filtered outputs.
We can observe that mutated G4 structures generally exhibit a trend of decreased formation potential.
Let’s examine
how the G4Hunter scores have changed in
the filtered_snv_eff
object.
We observe that the scores for the filtered G4s have decreased significantly.
Finally, we can use plotImpactSeq
to visualize the G4
sequences affected by SNVs.
For example, we can plot the sequence logos for the top five G4s with the most significant changes in their G4Hunter scores.
top5_snv_eff <- filtered_snv_eff[order(filtered_snv_eff$score.diff)[1:5]]
plotImpactSeq(top5_snv_eff, ncol = 2)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the ggseqlogo package.
## Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Let’s explore a coherent example to predict the potential impact of SNVs on G4 formation.
Suppose you already have the sequences saved as a
DNAStringset
object (chr21_seq
) and the SNVs
saved as a GRanges
object (snv_gr
), then
execute the following code,
# First step, predict G4s
chr21_G4 <- G4HunterDetect(chr21_seq)
# Second step, evaluate the impact of SNVs on G4s in 's' mode
snv_eff <- SNVImpactG4(chr21_G4, snv_gr, alt_col = "alt")
# Filter the results under default parameters
filtered_snv_eff <- filterSNVImpact(snv_eff, mut_score_threshold = 1.2)
# export as csv format
write.csv(as.data.frame(filtered_snv_eff), "/path/to/your/file.csv")
The author would like to thank Dr. Wenyong Zhu and Dr. Xiao Sun from Southeast University for their contributions: Dr. Zhu for providing the illustrations and testing the package, and Dr. Sun for reviewing this Vignettes.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] DT_0.33 BSgenome.Hsapiens.UCSC.hg19_1.4.3
## [3] BSgenome_1.75.0 rtracklayer_1.67.0
## [5] BiocIO_1.17.1 Biostrings_2.75.3
## [7] XVector_0.47.2 GenomicRanges_1.59.1
## [9] GenomeInfoDb_1.43.2 IRanges_2.41.2
## [11] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [13] generics_0.1.3 G4SNVHunter_0.99.4
## [15] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] viridisLite_0.4.2 farver_2.1.2
## [3] viridis_0.6.5 bitops_1.0-9
## [5] fastmap_1.2.0 RCurl_1.98-1.16
## [7] GenomicAlignments_1.43.0 XML_3.99-0.18
## [9] digest_0.6.37 lifecycle_1.0.4
## [11] magrittr_2.0.3 compiler_4.4.2
## [13] rlang_1.1.4 sass_0.4.9
## [15] progress_1.2.3 tools_4.4.2
## [17] yaml_2.3.10 data.table_1.16.4
## [19] knitr_1.49 labeling_0.4.3
## [21] prettyunits_1.2.0 S4Arrays_1.7.1
## [23] htmlwidgets_1.6.4 curl_6.1.0
## [25] DelayedArray_0.33.3 abind_1.4-8
## [27] BiocParallel_1.41.0 withr_3.0.2
## [29] sys_3.4.3 grid_4.4.2
## [31] ggpointdensity_0.1.0 colorspace_2.1-1
## [33] ggplot2_3.5.1 scales_1.3.0
## [35] SummarizedExperiment_1.37.0 cli_3.6.3
## [37] rmarkdown_2.29 crayon_1.5.3
## [39] httr_1.4.7 rjson_0.2.23
## [41] RcppRoll_0.3.1 cachem_1.1.0
## [43] ggseqlogo_0.2 zlibbioc_1.53.0
## [45] parallel_4.4.2 BiocManager_1.30.25
## [47] restfulr_0.0.15 matrixStats_1.5.0
## [49] vctrs_0.6.5 Matrix_1.7-1
## [51] jsonlite_1.8.9 hms_1.1.3
## [53] maketools_1.3.1 crosstalk_1.2.1
## [55] jquerylib_0.1.4 glue_1.8.0
## [57] codetools_0.2-20 cowplot_1.1.3
## [59] gtable_0.3.6 UCSC.utils_1.3.0
## [61] munsell_0.5.1 tibble_3.2.1
## [63] pillar_1.10.1 htmltools_0.5.8.1
## [65] GenomeInfoDbData_1.2.13 R6_2.5.1
## [67] evaluate_1.0.3 Biobase_2.67.0
## [69] lattice_0.22-6 Rsamtools_2.23.1
## [71] bslib_0.8.0 Rcpp_1.0.14
## [73] gridExtra_2.3 SparseArray_1.7.2
## [75] xfun_0.50 MatrixGenerics_1.19.1
## [77] buildtools_1.0.0 pkgconfig_2.0.3