Introduction to G4SNVHunter

Introduction

G-quadruplexes (G4s) are non-canonical secondary structures that can form in guanine-rich DNA or RNA regions (Varshney et al. 2020). These structures are significantly enriched in functional regions like gene promoters (Huppert and Balasubramanian 2007) and telomeres (Phan and Mergny 2002), and their regulatory roles are closely linked to their structures rather than their sequences (Zhang et al. 2024).

Single nucleotide variants (SNVs) are a common type of genetic variation. If such a variant occurs within a G4 region, it may affect the formation of that G4 structure.

G4SNVHunter is designed to evaluate the potential impact of SNVs on G4 formation, but it can also be used to assess single nucleotide polymorphisms (SNPs). This package leverages the highly effective G4 structure prediction algorithm, G4Hunter (Bedrat, Lacroix, and Mergny 2016), which was conceptualized by Mergny et al.  (Bedrat, Lacroix, and Mergny 2016). One of G4Hunter’s notable advantages is that it will evaluate the propensity of G4 structures within a specified window, effectively taking into account the influence of flanking sequences on G4 formation, not just the G4 sequence itself.

G4SNVHunter only requires users to provide genomic sequence data (DNAStringSet object) and substitution information of variants (GRanges object) to quickly determine which G4 structures may be affected by SNVs (or SNPs). We simplify the analysis process by integrating complex functionality into a simple function that can be used even by those inexperienced in G4 research or basic programming skills. Users can subsequently design ‘wet’ experiments based on the results of G4SNVHunter to further explore the mechanisms by which variants affect biological regulatory functions through their impact on G4 structures.

Installation

As the first step, you need to install G4SNVHunter, which can be done with a simple command, please ensure that your network connection is stable.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Currently, G4SNVHunter depends on the devel version of Bioconductor.
# Note: Bioconductor devel requires the latest R version (4.5 or higher).
# Or, you can install G4SNVHunter from GitHub instead.
BiocManager::install(version='devel')

# Install the G4SNVHunter package and its dependencies for this tutorial.
BiocManager::install("G4SNVHunter", dependencies = TRUE)

Then load G4SNVHunter to access its functions.

library(G4SNVHunter)

Load packages required for this tutorial

During this tutorial, we might need to use a few additional packages. Since we specified dependencies = TRUE when installing G4SNVHunter package, these additional packages have already been installed.

We can load them directly.

library(BSgenome.Hsapiens.UCSC.hg19)

library(GenomicRanges)

library(DT)

library(rtracklayer)

Input data

G4SNVHunter requires you to provide two types of data:

  • the genomic sequences (a DNAStringSet object)

  • the set of SNVs or SNPs (a GRanges object)

While the input formats for these data are quite flexible, they must ultimately be converted into the appropriate formats: DNAStringSet for sequences and GRanges for SNVs.

Detailed descriptions on how to prepare and format these inputs are provided below.

Genomic sequences

The genomic sequences refer to the sequence of the chromosome or fragment where your SNVs (or SNPs) are located. They can be complete chromosome sequences or large fragments extracted from chromosomes. We will predict G4 structures from the sequences you provide and analyze whether these SNVs can affect their formation.

Please note that the coordinates of your SNVs must be relative to the genomic sequence you provide, and they should be in 1-based coordinates.

The genomic sequences need to be processed into a DNAStringSet object. We provide a built-in function, loadSequence, to facilitate this processing. Of course, you can also load your customized sequence file and convert it into a DNAStringSet object without using loadSequence function.

loadSequence mainly accepts three types of input:

  • A two-column data.frame, with the first column containing the sequence identifiers and the second column containing their corresponding sequences. This should be specified using the genome_seq parameter.

  • The path to a stored FASTA file. The FASTA file must have a .fa, .fna, or .fasta extension. This should be specified using the seq_path parameter.

  • A text file (.txt) that stores the sequence identifiers and their corresponding sequences. The first column should contain the sequence identifiers, and the second column should contain the sequences. This should also be specified using the seq_path parameter.
    Note: Please do not include column names in the file!

Here are some examples for you to load sequences into a DNAStringSet object using loadSequence function:

Load from a data.frame object

seq_df <- data.frame(chr = c("seq1", "seq2"),
                     sequence = c(paste0(rep("G", 100), collapse = ""), 
                                  paste0(rep("A", 100), collapse = "")))
seq <- loadSequence(genome_seq = seq_df)

Load from a fasta file

# File path to the sequences in fasta format
fa_path <- system.file("extdata", "seq.fa", package = "G4SNVHunter")
seq <- loadSequence(seq_path = fa_path)

Load from a txt file

# File path to the sequences in txt format
txt_path <- system.file("extdata", "seq.txt", package = "G4SNVHunter")
seq <- loadSequence(seq_path = txt_path)

We can also obtain genome sequences for some common species from Bioconductor Annotation Packages. While this is convenient, it requires you to install some related packages in advance. For example:

# Load sequence for chromosome 21 (hg19)
hg19 <- BSgenome.Hsapiens.UCSC.hg19
chr21_seq <- DNAStringSet(hg19$chr21)
# Chromosome names are needed for analysis
names(chr21_seq) <- "chr21"

SNV data

The SNV (or SNP) data needs to be processed into a GRanges object. Since the form of SNV data is too flexible, we do not provide a function to load SNVs here. However, you can easily load and convert SNV data into a GRanges object yourself. For example,

# Path to SNPs
snp_path <- system.file("extdata", "snp.txt", package = "G4SNVHunter")
# Load SNPs into memory
snp <- read.table(snp_path, sep = "\t", header = FALSE)
# Convert snp to GRanges
snp <- GRanges(seqnames = snp$V1,
               ranges = IRanges(start = snp$V2, width = 1),
               rsid = snp$V3,
               ref = snp$V4,
               alt = snp$V5)
print(snp)
## GRanges object with 15 ranges and 3 metadata columns:
##        seqnames    ranges strand |         rsid         ref         alt
##           <Rle> <IRanges>  <Rle> |  <character> <character> <character>
##    [1]    chr21  15397335      * | rs1244782663           G           A
##    [2]    chr21  39931215      * |  rs910271214           G           A
##    [3]    chr21  22915361      * | rs1210166884           A           G
##    [4]    chr21  42644592      * |  rs554035836           A           G
##    [5]    chr21  10743898      * |  rs955290928           A           G
##    ...      ...       ...    ... .          ...         ...         ...
##   [11]    chr21  30836340      * | rs1212424980           T           C
##   [12]    chr21  36426864      * | rs1478442431           G           T
##   [13]    chr21  27584238      * | rs1191559536           T           C
##   [14]    chr21  36106373      * |  rs561521189           C           T
##   [15]    chr21  21840835      * | rs1301205088           T           C
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

SNV data validation

We provide a built-in checkSNV function for you to verify whether your SNV (or SNP) data meets the requirements of our program.

However, it is not necessary to run this function explicitly, as our program will automatically perform the checks at the appropriate stages. Nevertheless, performing a proactive check is always a good practice.

gr <- GRanges(
  seqnames = Rle("seq1"),
  ranges = IRanges(c(100, 200, 300), width=1),
    ref = c("A", "C", "G"),
    alt = c("T", "T", "A")
  )
# Check width ('w'), ref ('r'), and alt ('a')
# Returns TRUE if it passed the validation
checkSNV(gr, mode = "wra", ref_col = "ref", alt_col = "alt")
## [1] TRUE

Predict G4s

Leverage G4Hunter for G4 detection

You can use the G4HunterDetect function to directly predict G4s from a DNAStringSet object containing sequences based on G4Hunter algorithm.
For example,

# Sequence file in fasta file format 
fa_path <- system.file("extdata", "seq.fa", package = "G4SNVHunter")
# Load sequences
seq <- loadSequence(seq_path = fa_path)

# Predict G4s
G4_detected <- G4HunterDetect(seq)
## Start processing...
## Merging predicted G4s...
## Done!

Then, we can examine the prediction results, which are stored as a GRanges object.

You can use functions like print to directly view the results. However, in this instance, we’ll leverage the datatable function to view the predicted G4s, as this method provides a more user-friendly display.

Let’s take a look at the G4s,

datatable(as.data.frame(G4_detected), options = list(scrollX = TRUE))

This function will return a GRanges object containing all potential G4s predicted by the G4Hunter algorithms. It mainly includes:

  • The sequence identifier where the G4 is located (seqnames),

  • The position of the G4 relative to the sequence you provided (ranges, 1-based coordinate),

  • The strand on which the G4 is located (strand), with + indicating the G4 is on the positive strand, and - indicating it is on the negative strand,

  • The G4Hunter score for the G4 (score),

  • The maximum G4Hunter score of the windows covering that G4 during the execution of the G4Hunter algorithm (max_score, used to determine if the G4 surpasses the threshold),

  • The predicted G4 sequence on the positive strand (sequence, with C-rich means G4s on the reverse strand).

Please note that score reflects the precise score of the G4 sequence, while max_score represents the highest score obtained by sliding a window across that G4.

See the illustration below.

Some parameters for prediction can be customized. For example,

# Predict G4 by customizing other parameters
G4_detected <- G4HunterDetect(seq, threshold = 1.5, window_size = 20)
## Start processing...
## Merging predicted G4s...
## Done!
  • threshold: G4Hunter will search for G4s in windows above this threshold (absolute value). Default is 1.5. Unless there are special needs, we do not recommend setting the threshold below 1.2.

  • window_size: The window size (bp) for G4Hunter prediction. Default is 25. Another commonly used window size is 20. However, 25 is generally preferred unless otherwise required.

  • include_sequences: Whether to include the predicted G4 sequences in the output. Default is TRUE.

  • strands: Indicate which strand requires prediction, with b for both strands and p for positive strand only. Please note that if your genome is single-stranded, this parameter should be set to p to prevent the program from predicting G4s on a non-existent negative strand.

We generally do not recommend modifying certain parameters, such as window_size and threshold, as their default settings are already optimal.

Export prediction results

Since we store the prediction results as a GRanges object, exporting is very straightforward.

For example, you can export it into a CSV file,

# export as csv format
write.csv(as.data.frame(G4_detected), "/path/to/your/file.csv")

or export it into other formats as well.

# export as bed format
export(G4_detected, "/path/to/your/file.bed", format = "bed")
# export as bigwig format
export(G4_detected, "/path/to/your/file.bw", format = "bigWig")

Visualize G4Hunter prediction results

You can use plotG4Info function for visualizing the statistical information of G4s predicted by the G4HunterDetect function.

plotG4Info(G4_detected)

The left panels (A, C, E) show the overall distribution of the max_score (absolute value), score (absolute value), and length of G4s, respectively. The right panels (B, D, F) illustrate the distribution of max_score, score, and length for G4 structures on both the positive and negative strands. The scores for G4s on the positive strand are positive, while those on the negative strand are negative, resulting in a bimodal distribution of max_score and score. Please note that the positive or negative score reflects only the type of strand the G4 is on, whereas the capability of the G4 structure to form is determined by the absolute value of the score.

Evaluate SNV effects

We provide two modes in SNVImpactG4 function for you to assess the potential impact of SNVs on G4s:

  • Variant-centric assessment (s mode)

  • Sample-centric assessment (m mode)

In short, variant-centric assessment considers the impact of each SNV on the G4 sequence individually, while sample-centric assessment combines the effects of multiple SNVs on a particular G4 for each sample.

See the illustration below.

We have prepared example data with variants from chromosome 21 of the human genome (hg19), and you can easily load them,

data(snp_gr)

data(snv_gr)

Additionally, you can quickly and conveniently predict G4 sequences on chromosome 21 using the G4HunterDetect function.

# Predict the G4s in human chr 21 (hg19)
chr21_G4 <- G4HunterDetect(chr21_seq)
## Start processing...
## Now process: chr21.
## Merging predicted G4s...
## Done!

Variant-centric assessment

The default mode of SNVImpactG4 is variant-centric (mode = 's'), which requires only the G4 data in GRanges format (as returned by the G4HunterDetect function), the SNV data in GRanges format, and the column name for the alternate bases.

Then, the assessment can be done easily by,

snp_eff <- SNVImpactG4(chr21_G4, 
                       snp_gr, 
                       alt_col = "alt")

Let’s view the first three records,

datatable(as.data.frame(snp_eff[1:3]), options = list(scrollX = TRUE))

SNVImpactG4 function will return a GRanges object that includes detailed information about the SNVs (core columns and SNV.info.* columns), their associated G4 regions (G4.info.* columns), and the changes in G4 sequences (mut.G4.seq and mut.G4.anno.seq columns) and G4Hunter scores (mut.score and score.diff column).

Please note that SNVs located outside of G4 regions will be automatically ignored.

Sample-centric assessment

You can set the mode parameter in SNVImpactG4 function to 'm' to enable sample-centric mode. This mode is particularly useful for specific scenarios, such as analyzing SNVs derived from cancer patients.

When using this mode, you must also specify the column names for sample IDs (sampleid_col) and SNV IDs (snvid_col).

# Column names of the Sample ID and SNV ID must be specified
snv_eff <- SNVImpactG4(chr21_G4, 
                       snv_gr, 
                       alt_col = "alt", 
                       mode = "m", 
                       sampleid_col = "sampleid", 
                       snvid_col = "snv_id")

Let’s view the first three records,

datatable(as.data.frame(snv_eff[1:3]), options = list(scrollX = TRUE))

It will return a GRanges object containing detailed information about the G4s (core columns and G4.info.* columns). The object will also include the IDs of the SNVs (snv.ids column) located in the G4 regions, the sample IDs (sample.ids column), the mutated G4 sequences (mut.G4.seq and mut.G4.anno.seq columns), and the changes in G4Hunter scores (mut.score and score.diff columns).

Please note that in sample-centric mode, when multiple SNVs from the same sample occur within a single G4 region, we will consider the combined impact of these SNVs. For example, the following G4 overlaps with two SNVs (id_3608 and id_49857) from sample GQ3Y, so the final G4Hunter score is based on the combined effect of these two SNVs.

datatable(as.data.frame(snv_eff[528]), options = list(scrollX = TRUE))

Filtering potentially affected G4 sequences

Given that some SNVs may have minimal effects on G4 formation, we need to filter out those SNVs that are worth investigating further for experimental validation or additional analysis. In other words, identifying those variants that have the potential to disrupt the G4 structures.

We provide a convenient function, filterSNVImpact, to help with this filtering process.

There are three threshold parameters for you to adjust: raw_score_threshold, mut_score_threshold, and score_diff_threshold.

If raw_score_threshold (a positive number, but no more than 4) is specified, filterSNVImpact will filter out records where the absolute value of the original G4Hunter score is below this threshold.

If mut_score_threshold (a positive number, but no more than 4) is specified, filterSNVImpact will retain records where the G4Hunter score of the mutated G4 sequences does not exceed this threshold.

If score_diff_threshold (a negative number, but no less than -4) is specified, filterSNVImpact will retain records where the decrease in the G4Hunter score after mutation exceeds this threshold. For example, if score_diff_threshold = -0.2, those records will be retained: |G4HunterScoremut_seq| − |G4HunterScoreraw_seq| ≤ −0.2

Our recommendation is that raw_score_threshold should be greater than 1.5, and the mut_score_threshold should be less than 1.2. This is an empirically based guideline, and you are certainly free to adjust them to be more stringent or to use these parameters for flexible customization as needed.

Please note that you must specify at least one threshold parameter, but you do not need to specify all of them.

For example, using the results returned in mode m

filtered_snv_eff <- filterSNVImpact(snv_eff,
                                    mut_score_threshold = 1.2,
                                    score_diff_threshold = -0.35)

We can examine the filtered records to identify SNVs that have a substantial impact on G4 formation.

datatable(as.data.frame(filtered_snv_eff), options = list(scrollX = TRUE))

The final results can be saved to a file similarly to how G4 prediction results are saved, as described in Section 5.2.

Visualize changes in G4Hunter scores

You can use the plotSNVImpact function we provide to visualize changes in G4Hunter scores. This function is suitable for both variant-centric and sample-centric output data, as well as their filtered outputs.

We can observe that mutated G4 structures generally exhibit a trend of decreased formation potential.

plotSNVImpact(snv_eff)

Let’s examine how the G4Hunter scores have changed in the filtered_snv_eff object.

We observe that the scores for the filtered G4s have decreased significantly.

plotSNVImpact(filtered_snv_eff)

Plot the seqlogo for mutated G4 sequences

Finally, we can use plotImpactSeq to visualize the G4 sequences affected by SNVs.

For example, we can plot the sequence logos for the top five G4s with the most significant changes in their G4Hunter scores.

top5_snv_eff <- filtered_snv_eff[order(filtered_snv_eff$score.diff)[1:5]]
plotImpactSeq(top5_snv_eff, ncol = 2)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the ggseqlogo package.
##   Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

A coherent example

Let’s explore a coherent example to predict the potential impact of SNVs on G4 formation.

Suppose you already have the sequences saved as a DNAStringset object (chr21_seq) and the SNVs saved as a GRanges object (snv_gr), then execute the following code,

# First step, predict G4s
chr21_G4 <- G4HunterDetect(chr21_seq)

# Second step, evaluate the impact of SNVs on G4s in 's' mode
snv_eff <- SNVImpactG4(chr21_G4, snv_gr, alt_col = "alt")

# Filter the results under default parameters
filtered_snv_eff <- filterSNVImpact(snv_eff, mut_score_threshold = 1.2)

# export as csv format
write.csv(as.data.frame(filtered_snv_eff), "/path/to/your/file.csv")

Acknowledgements

The author would like to thank Dr. Wenyong Zhu and Dr. Xiao Sun from Southeast University for their contributions: Dr. Zhu for providing the illustrations and testing the package, and Dr. Sun for reviewing this Vignettes.

Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DT_0.33                           BSgenome.Hsapiens.UCSC.hg19_1.4.3
##  [3] BSgenome_1.75.0                   rtracklayer_1.67.0               
##  [5] BiocIO_1.17.1                     Biostrings_2.75.3                
##  [7] XVector_0.47.2                    GenomicRanges_1.59.1             
##  [9] GenomeInfoDb_1.43.2               IRanges_2.41.2                   
## [11] S4Vectors_0.45.2                  BiocGenerics_0.53.3              
## [13] generics_0.1.3                    G4SNVHunter_0.99.4               
## [15] BiocStyle_2.35.0                 
## 
## loaded via a namespace (and not attached):
##  [1] viridisLite_0.4.2           farver_2.1.2               
##  [3] viridis_0.6.5               bitops_1.0-9               
##  [5] fastmap_1.2.0               RCurl_1.98-1.16            
##  [7] GenomicAlignments_1.43.0    XML_3.99-0.18              
##  [9] digest_0.6.37               lifecycle_1.0.4            
## [11] magrittr_2.0.3              compiler_4.4.2             
## [13] rlang_1.1.4                 sass_0.4.9                 
## [15] progress_1.2.3              tools_4.4.2                
## [17] yaml_2.3.10                 data.table_1.16.4          
## [19] knitr_1.49                  labeling_0.4.3             
## [21] prettyunits_1.2.0           S4Arrays_1.7.1             
## [23] htmlwidgets_1.6.4           curl_6.1.0                 
## [25] DelayedArray_0.33.3         abind_1.4-8                
## [27] BiocParallel_1.41.0         withr_3.0.2                
## [29] sys_3.4.3                   grid_4.4.2                 
## [31] ggpointdensity_0.1.0        colorspace_2.1-1           
## [33] ggplot2_3.5.1               scales_1.3.0               
## [35] SummarizedExperiment_1.37.0 cli_3.6.3                  
## [37] rmarkdown_2.29              crayon_1.5.3               
## [39] httr_1.4.7                  rjson_0.2.23               
## [41] RcppRoll_0.3.1              cachem_1.1.0               
## [43] ggseqlogo_0.2               zlibbioc_1.53.0            
## [45] parallel_4.4.2              BiocManager_1.30.25        
## [47] restfulr_0.0.15             matrixStats_1.5.0          
## [49] vctrs_0.6.5                 Matrix_1.7-1               
## [51] jsonlite_1.8.9              hms_1.1.3                  
## [53] maketools_1.3.1             crosstalk_1.2.1            
## [55] jquerylib_0.1.4             glue_1.8.0                 
## [57] codetools_0.2-20            cowplot_1.1.3              
## [59] gtable_0.3.6                UCSC.utils_1.3.0           
## [61] munsell_0.5.1               tibble_3.2.1               
## [63] pillar_1.10.1               htmltools_0.5.8.1          
## [65] GenomeInfoDbData_1.2.13     R6_2.5.1                   
## [67] evaluate_1.0.3              Biobase_2.67.0             
## [69] lattice_0.22-6              Rsamtools_2.23.1           
## [71] bslib_0.8.0                 Rcpp_1.0.14                
## [73] gridExtra_2.3               SparseArray_1.7.2          
## [75] xfun_0.50                   MatrixGenerics_1.19.1      
## [77] buildtools_1.0.0            pkgconfig_2.0.3

References

Bedrat, Amina, Laurent Lacroix, and Jean-Louis Mergny. 2016. “Re-Evaluation of g-Quadruplex Propensity with G4Hunter.” Nucleic Acids Research 44 (4): 1746–59.
Huppert, Julian L, and Shankar Balasubramanian. 2007. “G-Quadruplexes in Promoters Throughout the Human Genome.” Nucleic Acids Research 35 (2): 406–13.
Phan, Anh Tuân, and Jean-Louis Mergny. 2002. “Human Telomeric DNA: G-Quadruplex, i-Motif and Watson–Crick Double Helix.” Nucleic Acids Research 30 (21): 4618–25.
Varshney, Dhaval, Jochen Spiegel, Katherine Zyner, David Tannahill, and Shankar Balasubramanian. 2020. “The Regulation and Functions of DNA and RNA g-Quadruplexes.” Nature Reviews Molecular Cell Biology 21 (8): 459–74.
Zhang, Rongxin, Yuqi Wang, Cheng Wang, Xiao Sun, and Jean-Louis Mergny. 2024. “G-Quadruplexes as Pivotal Components of Cis-Regulatory Elements in the Human Genome.” bioRxiv, 2024–01.