RNAmodR: creating classes for additional modification detection from high throughput sequencing.

Introduction

For users interested in the general aspect of any RNAmodR based package please have a look at the main vignette of the package.

This vignette is aimed at developers and researchers, who want to use the functionality of the RNAmodR package to develop a new modification strategy based on high throughput sequencing data.

library(RNAmodR)

Two classes have to be considered to establish a new analysis pipeline using RNAmodR. These are the SequenceData and the Modifier class.

A new SequenceData class

First, the SequenceData class has to be considered. Several classes are already implemented, which are:

  • End5SequenceData
  • End3SequenceData
  • EndSequenceData
  • ProtectedEndSequenceData
  • CoverageSequenceData
  • PileupSequenceData
  • NormEnd5SequenceData
  • NormEnd3SequenceData

If these cannot be reused, a new class can be implemented quite easily. First the DataFrame class, the Data class and a constructor has to defined. The only value, which has to be provided, is a default minQuality integer value and some basic information.

setClass(Class = "ExampleSequenceDataFrame",
         contains = "SequenceDFrame")
ExampleSequenceDataFrame <- function(df, ranges, sequence, replicate,
                                      condition, bamfiles, seqinfo){
  RNAmodR:::.SequenceDataFrame("Example",df, ranges, sequence, replicate,
                               condition, bamfiles, seqinfo)
}
setClass(Class = "ExampleSequenceData",
         contains = "SequenceData",
         slots = c(unlistData = "ExampleSequenceDataFrame"),
         prototype = list(unlistData = ExampleSequenceDataFrame(),
                          unlistType = "ExampleSequenceDataFrame",
                          minQuality = 5L,
                          dataDescription = "Example data"))
ExampleSequenceData <- function(bamfiles, annotation, sequences, seqinfo, ...){
  RNAmodR:::SequenceData("Example", bamfiles = bamfiles, 
                         annotation = annotation, sequences = sequences,
                         seqinfo = seqinfo, ...)
}

Second, the getData function has to be implemented. This is used to load the data from a bam file and must return a named list IntegerList, NumericList or CompressedSplitDataFrameList per file.

setMethod("getData",
          signature = c(x = "ExampleSequenceData",
                        bamfiles = "BamFileList",
                        grl = "GRangesList",
                        sequences = "XStringSet",
                        param = "ScanBamParam"),
          definition = function(x, bamfiles, grl, sequences, param, args){
            ###
          }
)

Third, the aggregate function has to be implemented. This function is used to aggregate data over replicates for all or one of the conditions. The resulting data is passed on to the Modifier class.

setMethod("aggregateData",
          signature = c(x = "ExampleSequenceData"),
          function(x, condition = c("Both","Treated","Control")){
            ###
          }
)

A new Modifier class

A new Modifier class is probably the main class, which needs to be implemented. Three variable have to be set. mod must be a single element from the Modstrings::shortName(Modstrings::ModRNAString()). score is the default score, which is used for several function. A column with this name should be returned from the aggregate function. dataType defines the SequenceData class to be used. dataType can contain multiple names of a SequenceData class, which are then combined to form a SequenceDataSet.

setClass("ModExample",
         contains = c("RNAModifier"),
         prototype = list(mod = "X",
                          score = "score",
                          dataType = "ExampleSequenceData"))
ModExample <- function(x, annotation, sequences, seqinfo, ...){
  RNAmodR:::Modifier("ModExample", x = x, annotation = annotation,
                     sequences = sequences, seqinfo = seqinfo, ...)
}

dataType can also be a list of character vectors, which leads then to the creation of SequenceDataList. However, for now this is a hypothetical case and should only be used, if the detection of a modification requires bam files from two or more different methods to be used to detect one modification.

The settings<- function can be amended to save specifc settings ( .norm_example_args must be defined seperatly to normalize input arguments in any way one sees fit).

setReplaceMethod(f = "settings", 
                 signature = signature(x = "ModExample"),
                 definition = function(x, value){
                   x <- callNextMethod()
                   # validate special setting here
                   x@settings[names(value)] <- unname(.norm_example_args(value))
                   x
                 })

The aggregateData function is used to take the aggregated data from the SequenceData object and to calculate the specific scores, which are then stored in the aggregate slot.

setMethod(f = "aggregateData", 
          signature = signature(x = "ModExample"),
          definition = 
            function(x, force = FALSE){
              # Some data with element per transcript
            }
)

The findMod function takes the aggregate data and searches for modifications, which are then returned as a GRanges object and stored in the modifications slot.

setMethod("findMod",
          signature = c(x = "ModExample"),
          function(x){
            # an element per modification found.
          }
)

A new ModifierSet class

The ModifierSet class is implemented very easily by defining the class and the constructor. The functionality is defined by the Modifier class.

setClass("ModSetExample",
         contains = "ModifierSet",
         prototype = list(elementType = "ModExample"))
ModSetExample <- function(x, annotation, sequences, seqinfo, ...){
  RNAmodR:::ModifierSet("ModExample", x = x, annotation = annotation,
                        sequences = sequences, seqinfo = seqinfo, ...)
}

Visualization functions

Additional functions, which need to be implemented, are getDataTrack for the new SequenceData and new Modifier classes and plotData/plotDataByCoord for the new Modifier and ModifierSet classes. name defines a transcript name found in names(ranges(x)) and type is the data type typically found as a column in the aggregate slot.

setMethod(
  f = "getDataTrack",
  signature = signature(x = "ExampleSequenceData"),
  definition = function(x, name, ...) {
    ###
  }
)
setMethod(
  f = "getDataTrack",
  signature = signature(x = "ModExample"),
  definition = function(x, name, type, ...) {
  }
)
setMethod(
  f = "plotDataByCoord",
  signature = signature(x = "ModExample", coord = "GRanges"),
  definition = function(x, coord, type = "score", window.size = 15L, ...) {
  }
)
setMethod(
  f = "plotData",
  signature = signature(x = "ModExample"),
  definition = function(x, name, from, to, type = "score", ...) {
  }
)
setMethod(
  f = "plotDataByCoord",
  signature = signature(x = "ModSetExample", coord = "GRanges"),
  definition = function(x, coord, type = "score", window.size = 15L, ...) {
  }
)
setMethod(
  f = "plotData",
  signature = signature(x = "ModSetExample"),
  definition = function(x, name, from, to, type = "score", ...) {
  }
)

If unsure, how to modify these functions, have a look a the code in the Modifier-Inosine-viz.R file of this package.

Summary

As suggested directly above, for a more detailed example have a look at the ModInosine class source code found in the Modifier-Inosine-class.R and Modifier-Inosine-viz.R files of this package.

Sessioninfo

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] RNAmodR_1.21.0           Modstrings_1.23.0        RNAmodR.Data_1.19.0     
##  [4] ExperimentHubData_1.33.0 AnnotationHubData_1.37.0 futile.logger_1.4.3     
##  [7] ExperimentHub_2.15.0     AnnotationHub_3.15.0     BiocFileCache_2.15.0    
## [10] dbplyr_2.5.0             txdbmaker_1.2.0          GenomicFeatures_1.59.0  
## [13] AnnotationDbi_1.69.0     Biobase_2.67.0           Rsamtools_2.22.0        
## [16] Biostrings_2.75.0        XVector_0.46.0           rtracklayer_1.66.0      
## [19] GenomicRanges_1.59.0     GenomeInfoDb_1.43.0      IRanges_2.41.0          
## [22] S4Vectors_0.44.0         BiocGenerics_0.53.0      BiocStyle_2.35.0        
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          rstudioapi_0.17.1          
##   [3] sys_3.4.3                   jsonlite_1.8.9             
##   [5] magrittr_2.0.3              farver_2.1.2               
##   [7] rmarkdown_2.28              BiocIO_1.17.0              
##   [9] zlibbioc_1.52.0             vctrs_0.6.5                
##  [11] ROCR_1.0-11                 memoise_2.0.1              
##  [13] RCurl_1.98-1.16             base64enc_0.1-3            
##  [15] htmltools_0.5.8.1           S4Arrays_1.6.0             
##  [17] BiocBaseUtils_1.9.0         progress_1.2.3             
##  [19] lambda.r_1.2.4              curl_5.2.3                 
##  [21] SparseArray_1.6.0           Formula_1.2-5              
##  [23] sass_0.4.9                  bslib_0.8.0                
##  [25] htmlwidgets_1.6.4           plyr_1.8.9                 
##  [27] Gviz_1.51.0                 httr2_1.0.5                
##  [29] futile.options_1.0.1        cachem_1.1.0               
##  [31] buildtools_1.0.0            GenomicAlignments_1.43.0   
##  [33] mime_0.12                   lifecycle_1.0.4            
##  [35] pkgconfig_2.0.3             Matrix_1.7-1               
##  [37] R6_2.5.1                    fastmap_1.2.0              
##  [39] GenomeInfoDbData_1.2.13     BiocCheck_1.43.0           
##  [41] MatrixGenerics_1.19.0       digest_0.6.37              
##  [43] colorspace_2.1-1            OrganismDbi_1.49.0         
##  [45] Hmisc_5.2-0                 RSQLite_2.3.7              
##  [47] labeling_0.4.3              filelock_1.0.3             
##  [49] colorRamps_2.3.4            fansi_1.0.6                
##  [51] httr_1.4.7                  abind_1.4-8                
##  [53] compiler_4.4.1              withr_3.0.2                
##  [55] bit64_4.5.2                 backports_1.5.0            
##  [57] htmlTable_2.4.3             biocViews_1.75.0           
##  [59] BiocParallel_1.41.0         DBI_1.2.3                  
##  [61] highr_0.11                  biomaRt_2.63.0             
##  [63] rappdirs_0.3.3              DelayedArray_0.33.1        
##  [65] rjson_0.2.23                tools_4.4.1                
##  [67] foreign_0.8-87              nnet_7.3-19                
##  [69] glue_1.8.0                  restfulr_0.0.15            
##  [71] grid_4.4.1                  stringdist_0.9.12          
##  [73] checkmate_2.3.2             reshape2_1.4.4             
##  [75] cluster_2.1.6               generics_0.1.3             
##  [77] gtable_0.3.6                BSgenome_1.75.0            
##  [79] ensembldb_2.31.0            data.table_1.16.2          
##  [81] hms_1.1.3                   xml2_1.3.6                 
##  [83] utf8_1.2.4                  BiocVersion_3.21.1         
##  [85] pillar_1.9.0                stringr_1.5.1              
##  [87] dplyr_1.1.4                 lattice_0.22-6             
##  [89] deldir_2.0-4                bit_4.5.0                  
##  [91] biovizBase_1.55.0           tidyselect_1.2.1           
##  [93] RBGL_1.82.0                 maketools_1.3.1            
##  [95] knitr_1.48                  gridExtra_2.3              
##  [97] ProtGenerics_1.38.0         SummarizedExperiment_1.36.0
##  [99] xfun_0.48                   matrixStats_1.4.1          
## [101] stringi_1.8.4               UCSC.utils_1.2.0           
## [103] lazyeval_0.2.2              yaml_2.3.10                
## [105] evaluate_1.0.1              codetools_0.2-20           
## [107] interp_1.1-6                tibble_3.2.1               
## [109] BiocManager_1.30.25         graph_1.85.0               
## [111] cli_3.6.3                   rpart_4.1.23               
## [113] munsell_0.5.1               jquerylib_0.1.4            
## [115] Rcpp_1.0.13                 dichromat_2.0-0.1          
## [117] png_0.1-8                   XML_3.99-0.17              
## [119] RUnit_0.4.33                parallel_4.4.1             
## [121] ggplot2_3.5.1               blob_1.2.4                 
## [123] prettyunits_1.2.0           jpeg_0.1-10                
## [125] latticeExtra_0.6-30         AnnotationFilter_1.31.0    
## [127] AnnotationForge_1.49.0      bitops_1.0-9               
## [129] VariantAnnotation_1.52.0    scales_1.3.0               
## [131] purrr_1.0.2                 crayon_1.5.3               
## [133] rlang_1.1.4                 KEGGREST_1.47.0            
## [135] formatR_1.14