RNAmodR: AlkAnilineSeq

Introduction

7-methyl guanosine (m7G), 3-methyl cytidine (m3C) and Dihydrouridine (D) are commonly found in rRNA and tRNA and can be detected classically by primer extension analysis. However, since the modifications do not interfere with Watson-Crick base pairing, a specific chemical treatment needs to be employed to cause strand breaks specifically at the modified positions. Initially, this involved a sodium borhydride treatment to create abasic sites and cleaving the RNA at abasic sites with aniline.

This classical protocol was converted to a high throughput sequencing method call AlkAnilineSeq and allows modified position be detected by an accumulation of 5’-ends at the N+1 position (Marchand et al. 2018). It was found, that m3C is susceptible to this treatment, which allows m7G, m3C and D to be detected by the same method from the same data sets, since the identify of the unmodified nucleotide informs about the three modified nucleotides.

The ModAlkAnilineSeq class uses the the NormEnd5SequenceData class to store and aggregate data along the transcripts. The calculated scores follow the nomenclature of (Marchand et al. 2018) with the names scoreNC (default) and scoreSR.

## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'ExperimentHubData'
library(rtracklayer)
library(GenomicRanges)
library(RNAmodR.AlkAnilineSeq)
library(RNAmodR.Data)

Example workflow

The example workflow is limited to 18S rRNA and some tRNA from S.cerevisiae. As annotation data either a gff file or a TxDb object and for sequence data a fasta file or a BSgenome object can be used. The data is provided as bam files.

annotation <- GFF3File(RNAmodR.Data.example.AAS.gff3())
sequences <- RNAmodR.Data.example.AAS.fasta()
files <- list("wt" = c(treated = RNAmodR.Data.example.wt.1(),
                       treated = RNAmodR.Data.example.wt.2(),
                       treated = RNAmodR.Data.example.wt.3()),
              "Bud23del" = c(treated = RNAmodR.Data.example.bud23.1(),
                             treated = RNAmodR.Data.example.bud23.2()),
              "Trm8del" = c(treated = RNAmodR.Data.example.trm8.1(),
                            treated = RNAmodR.Data.example.trm8.2()))

The analysis is triggered by the construction of a ModSetAlkAnilineSeq object. Internally parallelization is used via the BiocParallel package, which would allow optimization depending on number/size of input files (number of samples, number of replicates, number of transcripts, etc).

msaas <- ModSetAlkAnilineSeq(files, annotation = annotation, sequences = sequences)
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ...
## Warning in .makeTxDb_normarg_chrominfo(chrominfo): genome version information
## is not available for this TxDb object
## OK
msaas
## ModSetAlkAnilineSeq of length 3
## names(3): wt Bud23del Trm8del
## | Modification type(s):  m7G / m3C / D                                               
##                             wt Bud23del Trm8del
## | Modifications found: yes (9)  yes (8) yes (7)
## | Settings:
##          minCoverage minReplicate  find.mod minLength minSignal minScoreNC
##            <integer>    <integer> <logical> <integer> <integer>  <integer>
## wt                10            1      TRUE         9        10         50
## Bud23del          10            1      TRUE         9        10         50
## Trm8del           10            1      TRUE         9        10         50
##          minScoreSR minScoreBaseScore scoreOperator
##           <numeric>         <numeric>   <character>
## wt              0.5               0.9             &
## Bud23del        0.5               0.9             &
## Trm8del         0.5               0.9             &

As expected the m7G1575 is missing from the Bud23del samples.

mod <- modifications(msaas)
lapply(mod,head, n = 2L)
## $wt
## GRanges object with 2 ranges and 6 metadata columns:
##       seqnames    ranges strand |         mod                source        type
##          <Rle> <IRanges>  <Rle> | <character>           <character> <character>
##   [1]     chr1      1575      + |         m7G RNAmodR.AlkAnilineSeq      RNAMOD
##   [2]     chr3        46      + |         m7G RNAmodR.AlkAnilineSeq      RNAMOD
##           score   scoreSR      Parent
##       <numeric> <numeric> <character>
##   [1]   162.228  0.984209           1
##   [2]   373.773  0.841166           3
##   -------
##   seqinfo: 11 sequences from an unspecified genome; no seqlengths
## 
## $Bud23del
## GRanges object with 2 ranges and 6 metadata columns:
##       seqnames    ranges strand |         mod                source        type
##          <Rle> <IRanges>  <Rle> | <character>           <character> <character>
##   [1]     chr3        46      + |         m7G RNAmodR.AlkAnilineSeq      RNAMOD
##   [2]     chr5        50      + |         m7G RNAmodR.AlkAnilineSeq      RNAMOD
##           score   scoreSR      Parent
##       <numeric> <numeric> <character>
##   [1]  254.6403  0.858101           3
##   [2]   86.3556  0.605249           5
##   -------
##   seqinfo: 11 sequences from an unspecified genome; no seqlengths
## 
## $Trm8del
## GRanges object with 2 ranges and 6 metadata columns:
##       seqnames    ranges strand |         mod                source        type
##          <Rle> <IRanges>  <Rle> | <character>           <character> <character>
##   [1]     chr1      1575      + |         m7G RNAmodR.AlkAnilineSeq      RNAMOD
##   [2]     chr3        37      + |         m7G RNAmodR.AlkAnilineSeq      RNAMOD
##           score   scoreSR      Parent
##       <numeric> <numeric> <character>
##   [1]  117.2479   0.98729           1
##   [2]   69.9604   0.97953           3
##   -------
##   seqinfo: 11 sequences from an unspecified genome; no seqlengths

Visualizing the results

As outlined in the RNAmodR package we can compare the samples using the plotCompareByCoord to prepare a heatmap. For this we select some position from the found modifications. In addition we prepare an alias table.

coord <- mod[[1L]]
alias <- data.frame(tx_id = c(1L,3L,5L,6L,7L,8L,10L,11L),
                    name = c("18S rRNA","tF(GAA)B","tG(GCC)B","tT(AGT)B",
                             "tQ(TTG)B","tC(GCA)B","tS(CGA)C","tV(AAC)E1"),
                    stringsAsFactors = FALSE)
plotCompareByCoord(msaas, coord, score = "scoreSR", alias = alias,
                   normalize = TRUE)
Heatmap showing Stop ratio scores for detected m7G, m3C and D positions.

Heatmap showing Stop ratio scores for detected m7G, m3C and D positions.

plotCompareByCoord(msaas, coord[1L], score = "scoreSR", alias = alias)
Heatmap showing Stop ratio scores for detected m7G1575 on the 18S rRNA.

Heatmap showing Stop ratio scores for detected m7G1575 on the 18S rRNA.

In addition, the aggregate data along the transcript visualized as well.

plotData(msaas, "1", from = 1550L, to = 1600L)
Stop ratio scores for detected m7G1575 on the 18S rRNA plotted as bar plots along the sequence.

Stop ratio scores for detected m7G1575 on the 18S rRNA plotted as bar plots along the sequence.

This includes raw data as well.

plotData(msaas[1L:2L], "1", from = 1550L, to = 1600L, showSequenceData = TRUE)
Stop ratio scores for detected m7G1575 on the 18S rRNA plotted as bar plots along the sequence. The raw sequence data is shown by setting `showSequenceData = TRUE.

Stop ratio scores for detected m7G1575 on the 18S rRNA plotted as bar plots along the sequence. The raw sequence data is shown by setting `showSequenceData = TRUE.

Session info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] Rsamtools_2.23.1             RNAmodR.Data_1.20.0         
##  [3] ExperimentHubData_1.33.0     AnnotationHubData_1.37.0    
##  [5] futile.logger_1.4.3          ExperimentHub_2.15.0        
##  [7] AnnotationHub_3.15.0         BiocFileCache_2.15.0        
##  [9] dbplyr_2.5.0                 RNAmodR.AlkAnilineSeq_1.21.0
## [11] RNAmodR_1.21.0               Modstrings_1.23.0           
## [13] Biostrings_2.75.1            XVector_0.47.0              
## [15] rtracklayer_1.67.0           GenomicRanges_1.59.1        
## [17] GenomeInfoDb_1.43.2          IRanges_2.41.1              
## [19] S4Vectors_0.45.2             BiocGenerics_0.53.3         
## [21] generics_0.1.3               BiocStyle_2.35.0            
## 
## loaded via a namespace (and not attached):
##   [1] BiocIO_1.17.1               bitops_1.0-9               
##   [3] filelock_1.0.3              tibble_3.2.1               
##   [5] graph_1.85.0                XML_3.99-0.17              
##   [7] rpart_4.1.23                lifecycle_1.0.4            
##   [9] httr2_1.0.7                 lattice_0.22-6             
##  [11] ensembldb_2.31.0            OrganismDbi_1.49.0         
##  [13] backports_1.5.0             magrittr_2.0.3             
##  [15] Hmisc_5.2-0                 sass_0.4.9                 
##  [17] rmarkdown_2.29              jquerylib_0.1.4            
##  [19] yaml_2.3.10                 RUnit_0.4.33               
##  [21] Gviz_1.51.0                 DBI_1.2.3                  
##  [23] buildtools_1.0.0            RColorBrewer_1.1-3         
##  [25] abind_1.4-8                 zlibbioc_1.52.0            
##  [27] purrr_1.0.2                 AnnotationFilter_1.31.0    
##  [29] biovizBase_1.55.0           RCurl_1.98-1.16            
##  [31] nnet_7.3-19                 VariantAnnotation_1.53.0   
##  [33] rappdirs_0.3.3              GenomeInfoDbData_1.2.13    
##  [35] AnnotationForge_1.49.0      maketools_1.3.1            
##  [37] codetools_0.2-20            DelayedArray_0.33.2        
##  [39] xml2_1.3.6                  tidyselect_1.2.1           
##  [41] farver_2.1.2                UCSC.utils_1.3.0           
##  [43] matrixStats_1.4.1           base64enc_0.1-3            
##  [45] GenomicAlignments_1.43.0    jsonlite_1.8.9             
##  [47] Formula_1.2-5               tools_4.4.2                
##  [49] progress_1.2.3              stringdist_0.9.12          
##  [51] Rcpp_1.0.13-1               glue_1.8.0                 
##  [53] gridExtra_2.3               SparseArray_1.7.2          
##  [55] BiocBaseUtils_1.9.0         xfun_0.49                  
##  [57] MatrixGenerics_1.19.0       dplyr_1.1.4                
##  [59] withr_3.0.2                 formatR_1.14               
##  [61] BiocManager_1.30.25         fastmap_1.2.0              
##  [63] latticeExtra_0.6-30         fansi_1.0.6                
##  [65] digest_0.6.37               mime_0.12                  
##  [67] R6_2.5.1                    colorspace_2.1-1           
##  [69] jpeg_0.1-10                 dichromat_2.0-0.1          
##  [71] biomaRt_2.63.0              RSQLite_2.3.8              
##  [73] utf8_1.2.4                  data.table_1.16.2          
##  [75] prettyunits_1.2.0           httr_1.4.7                 
##  [77] htmlwidgets_1.6.4           S4Arrays_1.7.1             
##  [79] pkgconfig_2.0.3             gtable_0.3.6               
##  [81] blob_1.2.4                  sys_3.4.3                  
##  [83] htmltools_0.5.8.1           RBGL_1.83.0                
##  [85] ProtGenerics_1.39.0         scales_1.3.0               
##  [87] Biobase_2.67.0              png_0.1-8                  
##  [89] colorRamps_2.3.4            knitr_1.49                 
##  [91] lambda.r_1.2.4              rstudioapi_0.17.1          
##  [93] reshape2_1.4.4              rjson_0.2.23               
##  [95] checkmate_2.3.2             curl_6.0.1                 
##  [97] biocViews_1.75.0            cachem_1.1.0               
##  [99] stringr_1.5.1               BiocVersion_3.21.1         
## [101] parallel_4.4.2              foreign_0.8-87             
## [103] AnnotationDbi_1.69.0        restfulr_0.0.15            
## [105] pillar_1.9.0                grid_4.4.2                 
## [107] vctrs_0.6.5                 cluster_2.1.6              
## [109] htmlTable_2.4.3             evaluate_1.0.1             
## [111] GenomicFeatures_1.59.1      cli_3.6.3                  
## [113] compiler_4.4.2              futile.options_1.0.1       
## [115] rlang_1.1.4                 crayon_1.5.3               
## [117] labeling_0.4.3              interp_1.1-6               
## [119] plyr_1.8.9                  stringi_1.8.4              
## [121] deldir_2.0-4                BiocParallel_1.41.0        
## [123] BiocCheck_1.43.2            txdbmaker_1.3.1            
## [125] munsell_0.5.1               lazyeval_0.2.2             
## [127] Matrix_1.7-1                BSgenome_1.75.0            
## [129] hms_1.1.3                   bit64_4.5.2                
## [131] ggplot2_3.5.1               KEGGREST_1.47.0            
## [133] SummarizedExperiment_1.37.0 ROCR_1.0-11                
## [135] memoise_2.0.1               bslib_0.8.0                
## [137] bit_4.5.0

References

Marchand, Virginie, Lilia Ayadi, Felix G. M. Ernst, Jasmin Hertler, Valérie Bourguignon-Igel, Adeline Galvanin, Annika Kotter, Mark Helm, Denis L. J. Lafontaine, and Yuri Motorin. 2018. “AlkAniline-Seq: Profiling of m7G and m3C RNA Modifications at Single Nucleotide Resolution.” Angewandte Chemie International Edition 57 (51): 16785–90. https://doi.org/10.1002/anie.201810946.